
本筆記本展示如何套用 scikit-network 進行內容推薦。

我們使用 Movielens 資料集,取自 netset 彙編,包含 671 位使用者對 9066 部電影的評分。

from IPython.display import SVG
import numpy as np
from scipy.cluster.hierarchy import linkage
from sknetwork.data import load_netset
from sknetwork.ranking import PageRank, top_k
from sknetwork.embedding import Spectral
from sknetwork.utils import get_neighbors
from sknetwork.visualization import visualize_dendrogram


dataset = load_netset('movielens')
Downloading movielens from NetSet...
Unpacking archive...
Parsing files...
biadjacency = dataset.biadjacency
names = dataset.names
labels = dataset.labels
names_labels = dataset.names_labels
<9066x671 sparse matrix of type '<class 'numpy.float64'>'
        with 100004 stored elements in Compressed Sparse Row format>
n_movies, n_users = biadjacency.shape
# ratings
np.unique(biadjacency.data, return_counts=True)
(array([0.5, 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5, 5. ]),
 array([ 1101,  3326,  1687,  7271,  4449, 20064, 10538, 28750,  7723,
# positive ratings
positive = biadjacency >= 3
<9066x671 sparse matrix of type '<class 'numpy.bool_'>'
        with 82170 stored elements in Compressed Sparse Row format>
array(['Action', 'Adventure', 'Animation', 'Children', 'Comedy', 'Crime',
       'Documentary', 'Drama', 'Fantasy', 'Film-Noir', 'Horror', 'IMAX',
       'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War',
       'Western'], dtype='<U11')
(9066, 19)


我們首先使用(個人化的)PageRank 取得每個類別中最受歡迎的電影。

pagerank = PageRank()
# top-10 movies
scores = pagerank.fit_predict(positive)
names[top_k(scores, 10)]
array(['Forrest Gump (1994)', 'Pulp Fiction (1994)',
       'Shawshank Redemption, The (1994)',
       'Silence of the Lambs, The (1991)',
       'Star Wars: Episode IV - A New Hope (1977)', 'Matrix, The (1999)',
       'Jurassic Park (1993)', "Schindler's List (1993)",
       'Back to the Future (1985)',
       'Star Wars: Episode V - The Empire Strikes Back (1980)'],
# number of movies per genre
n_selection = 10
# selection
selection = []
for label in np.arange(len(names_labels)):
    ppr = pagerank.fit_predict(positive, weights=labels[:, label])
    scores = ppr * labels[:, label]
    selection.append(top_k(scores, n_selection))
selection = np.array(selection)
# show selection (some movies may have several genres)
for label, name_label in enumerate(names_labels):
    print(label, name_label)
    print(names[selection[label, :5]])
0 Action
['Star Wars: Episode IV - A New Hope (1977)' 'Matrix, The (1999)'
 'Jurassic Park (1993)'
 'Star Wars: Episode V - The Empire Strikes Back (1980)'
 'Terminator 2: Judgment Day (1991)']
1 Adventure
['Star Wars: Episode IV - A New Hope (1977)' 'Jurassic Park (1993)'
 'Star Wars: Episode V - The Empire Strikes Back (1980)'
 'Back to the Future (1985)' 'Toy Story (1995)']
2 Animation
['SpongeBob SquarePants Movie, The (2004)' 'Tangled Ever After (2012)'
 'Space Chimps (2008)' 'Pokémon 3: The Movie (2001)' 'Valiant (2005)']
3 Children
['Thomas and the Magic Railroad (2000)' 'Smurfs 2, The (2013)'
 'Like Mike (2002)' 'Hey Arnold! The Movie (2002)'
 'Race to Witch Mountain (2009)']
4 Comedy
['Forrest Gump (1994)' 'Pulp Fiction (1994)' 'Back to the Future (1985)'
 'Toy Story (1995)' 'Fargo (1996)']
5 Crime
['Pulp Fiction (1994)' 'Shawshank Redemption, The (1994)'
 'Silence of the Lambs, The (1991)' 'Fargo (1996)' 'Godfather, The (1972)']
6 Documentary
['SOMM: Into the Bottle (2016)' 'Cocaine Cowboys: Reloaded (2014)'
 "Cocaine Cowboys II: Hustlin' With the Godmother (2008)"
 'Agony and the Ecstasy of Phil Spector, The (2009)' 'Promises (2001)']
7 Drama
['Pulp Fiction (1994)' 'Forrest Gump (1994)'
 'Shawshank Redemption, The (1994)' "Schindler's List (1993)"
 'American Beauty (1999)']
8 Fantasy
['Twilight Saga: Eclipse, The (2010)' 'Fat Albert (2004)'
 'Nightbreed (1990)' 'Beastmaster 2: Through the Portal of Time (1991)'
 'Solace (2015)']
9 Film-Noir
['Kiss Before Dying, A (1956)' 'T-Men (1947)' 'No Way Out (1950)'
 'Force of Evil (1948)' 'Bullet to the Head (2012)']
10 Horror
['Silence of the Lambs, The (1991)' 'Rogue (2007)'
 'Paranormal Activity: The Marked Ones (2014)' 'Ring of Terror (1962)'
 'Carnosaur 3: Primal Species (1996)']
['Jack the Giant Slayer (2013)' "Dr. Seuss' The Lorax (2012)"
 'After Earth (2013)' 'Resident Evil: Retribution (2012)'
 'Mars Needs Moms (2011)']
12 Musical
['First Nudie Musical, The (1976)' 'Zoot Suit (1981)' 'Yentl (1983)'
 "Dr. Seuss' The Lorax (2012)" 'Singing Detective, The (2003)']
13 Mystery
['Spirits of the Dead (1968)' 'Oscar (1991)' 'Solace (2015)'
 'Nomads (1986)'
 'Adventures of Mary-Kate and Ashley, The: The Case of the United States Navy Adventure (1997)']
14 Romance
['Forrest Gump (1994)' 'American Beauty (1999)'
 'Princess Bride, The (1987)' 'Beauty and the Beast (1991)'
 'Good Will Hunting (1997)']
15 Sci-Fi
['Star Wars: Episode IV - A New Hope (1977)' 'Matrix, The (1999)'
 'Star Wars: Episode V - The Empire Strikes Back (1980)'
 'Jurassic Park (1993)' 'Back to the Future (1985)']
16 Thriller
['Pulp Fiction (1994)' 'Silence of the Lambs, The (1991)'
 'Matrix, The (1999)' 'Jurassic Park (1993)' 'Fargo (1996)']
17 War
['Iron Eagle II (1988)' 'Dark Blue World (Tmavomodrý svet) (2001)'
 'Wind That Shakes the Barley, The (2006)' 'Pathfinder (2007)'
 'Night of the Generals, The (1967)']
18 Western
['The Ridiculous 6 (2015)' 'Shakiest Gun in the West, The (1968)'
 "'Neath the Arizona Skies (1934)" 'Stagecoach (1966)'
 'Missing, The (2003)']

我們現在套用 PageRank 取得與特定電影相關的最相關電影。

target = {i: name for i, name in enumerate(names) if 'Cherbourg' in name}
{175: 'Umbrellas of Cherbourg, The (Parapluies de Cherbourg, Les) (1964)'}
scores_ppr = pagerank.fit_predict(positive, weights={175:1})
names[top_k(scores_ppr - scores, 10)]
array(['Umbrellas of Cherbourg, The (Parapluies de Cherbourg, Les) (1964)',
       'Fargo (1996)', 'Pulp Fiction (1994)',
       'Star Wars: Episode IV - A New Hope (1977)',
       'L.A. Confidential (1997)', 'Matrix, The (1999)',
       'Shawshank Redemption, The (1994)', 'American Beauty (1999)',
       'Clockwork Orange, A (1971)', 'Jurassic Park (1993)'], dtype=object)

我們也可以套用 PageRank 為使用者推薦電影。

user = 1
targets = get_neighbors(positive, user, transpose=True)
# seen movies (sample)
array(['GoldenEye (1995)', 'Sense and Sensibility (1995)',
       'Clueless (1995)', 'Seven (a.k.a. Se7en) (1995)',
       'Usual Suspects, The (1995)', 'Mighty Aphrodite (1995)',
       "Mr. Holland's Opus (1995)", 'Braveheart (1995)',
       'Brothers McMullen, The (1995)', 'Apollo 13 (1995)'], dtype=object)
mask = np.zeros(len(names), dtype=bool)
mask[targets] = 1
scores_ppr = pagerank.fit_predict(positive, weights=mask)
# top-10 recommendation
names[top_k((scores_ppr - scores) * (1 - mask), 10)]
array(['Shawshank Redemption, The (1994)', 'True Lies (1994)',
       'Star Wars: Episode IV - A New Hope (1977)',
       'Beauty and the Beast (1991)', 'Toy Story (1995)',
       'Twelve Monkeys (a.k.a. 12 Monkeys) (1995)', 'Fargo (1996)',
       'Independence Day (a.k.a. ID4) (1996)', 'Matrix, The (1999)',
       'Star Wars: Episode V - The Empire Strikes Back (1980)'],


我們現在用低維度向量表示每部電影,並使用階層式分群為前 100 名電影視覺化此嵌入的結構。

# embedding
spectral = Spectral(10)
embedding = spectral.fit_transform(positive)
# top-100 movies
scores = pagerank.fit_predict(positive)
index = top_k(scores, 100)
dendrogram = linkage(embedding[index], method='ward')
# visualization
image = visualize_dendrogram(dendrogram, names=names[index], rotate=True, width=200, height=1000, n_clusters=6)
