文字探勘

我們展示如何將 scikit-network 用於文字探勘。我們在此討論維克多雨果的長篇小說悲慘世界(古騰堡計畫,伊莎貝爾 F. 哈普古德翻譯)。透過考慮詞彙和段落之間的圖形,我們可以在同一向量空間中嵌入詞彙和段落,並計算它們之間的餘弦相似度。

每個詞彙都如同在原文中那樣考量;也可以改用更進階的標記化程式

可以考慮其他圖形,例如在 5 個詞彙的視窗內詞彙共現的圖形,或章節和詞彙的圖形。可以結合這些圖形以獲得更豐富的資訊和更佳的嵌入。

[1]:
from re import sub
[2]:
import numpy as np
[3]:
from sknetwork.data import from_adjacency_list
from sknetwork.embedding import Spectral

載入資料

[4]:
filename = 'miserables-en.txt'
[5]:
with open(filename, 'r') as f:
    text = f.read()
[6]:
len(text)
[6]:
3254333
[7]:
print(text[:494])
The Project Gutenberg EBook of Les Misérables, by Victor Hugo

This eBook is for the use of anyone anywhere at no cost and with almost
no restrictions whatsoever. You may copy it, give it away or re-use
it under the terms of the Project Gutenberg License included with this
eBook or online at www.gutenberg.org


Title: Les Misérables
       Complete in Five Volumes

Author: Victor Hugo

Translator: Isabel F. Hapgood

Release Date: June 22, 2008 [EBook #135]
Last Updated: January 18, 2016


前處理

[8]:
# extract main text
main = text.split('LES MISÉRABLES')[-2].lower()
[9]:
len(main)
[9]:
3215017
[10]:
# remove ponctuation
main = sub(r"[,.;:()@#?!&$'_*]", " ", main)
main = sub(r'["-]', ' ', main)
[11]:
# extract paragraphs
sep = '|||'
main = sub(r'\n\n+', sep, main)
main = sub('\n', ' ', main)
paragraphs = main.split(sep)
[12]:
len(paragraphs)
[12]:
13499
[13]:
paragraphs[1000]
[13]:
'after leaving the asses there was a fresh delight  they crossed the seine in a boat  and proceeding from passy on foot they reached the barrier of l étoile  they had been up since five o clock that morning  as the reader will remember  but  bah  there is no such thing as fatigue on sunday   said favourite   on sunday fatigue does not work  '

建立圖形

[14]:
paragraph_words = [paragraph.split(' ') for paragraph in paragraphs]
[15]:
graph = from_adjacency_list(paragraph_words, bipartite=True)
[16]:
biadjacency = graph.biadjacency
words = graph.names_col
[17]:
biadjacency
[17]:
<13499x23093 sparse matrix of type '<class 'numpy.int64'>'
        with 416331 stored elements in Compressed Sparse Row format>
[18]:
len(words)
[18]:
23093

統計資料

[19]:
n_row, n_col = biadjacency.shape
[20]:
paragraph_lengths = biadjacency.dot(np.ones(n_col))
[21]:
np.quantile(paragraph_lengths, [0.1, 0.5, 0.9, 0.99])
[21]:
array([  6.,  23., 127., 379.])
[22]:
word_counts = biadjacency.T.dot(np.ones(n_row))
[23]:
np.quantile(word_counts, [0.1, 0.5, 0.9, 0.99])
[23]:
array([  1.  ,   2.  ,  23.  , 282.08])

嵌入

[24]:
dimension = 50
spectral = Spectral(dimension, regularization=100)
[25]:
spectral.fit(biadjacency)
[25]:
Spectral(n_components=50, decomposition='rw', regularization=100, normalized=True)
[26]:
embedding_paragraph = spectral.embedding_row_
embedding_word = spectral.embedding_col_
[27]:
# some word
i = int(np.argwhere(words == 'love'))
/tmp/ipykernel_4628/582388984.py:2: DeprecationWarning: Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.)
  i = int(np.argwhere(words == 'love'))
[28]:
# most similar words
cosines_word = embedding_word.dot(embedding_word[i])
words[np.argsort(-cosines_word)[:20]]
[28]:
array(['love', 'kiss', 'ye', 'celestial', 'hearts', 'loved', 'tender',
       'roses', 'joys', 'sweet', 'wedded', 'charming', 'angelic', 'adore',
       'aurora', 'pearl', 'voluptuousness', 'chaste', 'innumerable',
       'heart'], dtype='<U21')
[29]:
np.quantile(cosines_word, [0.01, 0.1, 0.5, 0.9, 0.99])
[29]:
array([-0.24307366, -0.14047851, -0.02607974,  0.14319717,  0.42843234])
[30]:
# some paragraph
i = 1000
print(paragraphs[i])
after leaving the asses there was a fresh delight  they crossed the seine in a boat  and proceeding from passy on foot they reached the barrier of l étoile  they had been up since five o clock that morning  as the reader will remember  but  bah  there is no such thing as fatigue on sunday   said favourite   on sunday fatigue does not work
[31]:
# most similar paragraphs
cosines_paragraph = embedding_paragraph.dot(embedding_paragraph[i])
for j in np.argsort(-cosines_paragraph)[:3]:
    print(paragraphs[j])
    print()
after leaving the asses there was a fresh delight  they crossed the seine in a boat  and proceeding from passy on foot they reached the barrier of l étoile  they had been up since five o clock that morning  as the reader will remember  but  bah  there is no such thing as fatigue on sunday   said favourite   on sunday fatigue does not work

he was a man of lofty stature  half peasant  half artisan  he wore a huge leather apron  which reached to his left shoulder  and which a hammer  a red handkerchief  a powder horn  and all sorts of objects which were upheld by the girdle  as in a pocket  caused to bulge out  he carried his head thrown backwards  his shirt  widely opened and turned back  displayed his bull neck  white and bare  he had thick eyelashes  enormous black whiskers  prominent eyes  the lower part of his face like a snout  and besides all this  that air of being on his own ground  which is indescribable

this was the state which the shepherd idyl  begun at five o clock in the morning  had reached at half past four in the afternoon  the sun was setting  their appetites were satisfied

[32]:
np.quantile(cosines_paragraph, [0.01, 0.1, 0.5, 0.9, 0.99])

[32]:
array([-0.30671191, -0.17309593, -0.00319729,  0.21574375,  0.45969887])