Exploring Anime Recommendation System Approaches

- Introduction
- Data
- Baseline Model & Evaluation
- Content Based Filtering
- Collaborative Filtering
- Hybrid Model
- Conclusion
1. Introduction
In this notebook we will be exploring recommendation systems using 3 different approaches, applying them to data we have scraped from a popular anime database/community website.
The three different approaches we will be exploring are
- - Content Based Filtering
- - Collaborative Filtering
- - Hybrid (Combination of the first two approaches)
For the sake of comparison and evaluation, there will also be a baseline model that provides recommendations based on the most popular titles that users have not interacted with.
2. Data
We will be using scraped datasets that we have obtained previously. There are 3 separate datasets that can provide us the data we need to create the recommendation systems.
- - cleaned_anime_info.csv
- - cleaned_anime_reviews.csv
- - cleaned_user_ratings.csv
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import math
from ast import literal_eval
from collections import defaultdict
from nltk.corpus import stopwords
from scipy.sparse import csr_matrix
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity, linear_kernel
from scipy.sparse.linalg import svds
from scipy.sparse import hstack
from sklearn.preprocessing import StandardScaler, LabelEncoder, MultiLabelBinarizer
from multiprocessing.pool import ThreadPool as Pool
from datetime import datetime
import random
import warnings
warnings.filterwarnings('ignore')
Content Data From Anime Titles
df_info = pd.read_csv('cleaned_anime_info.csv')
df_info.head()
MAL_Id | Name | Type | Episodes | Status | Producers | Licensors | Studios | Source | Genres | ... | Score-2 | Score-1 | Synopsis | Voice_Actors | Recommended_Ids | Recommended_Counts | Aired_Start | Aired_End | Premiered_Season | Rank | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 52991 | Sousou no Frieren | TV | 28.0 | Finished Airing | ['Aniplex', 'Dentsu', 'Shogakukan-Shueisha Pro... | ['None found', 'add some'] | ['Madhouse'] | Manga | ['Adventure', 'Drama', 'Fantasy', 'Shounen'] | ... | 402 | 4100 | During their decade-long quest to defeat the D... | ['Tanezaki, Atsumi', 'Ichinose, Kana', 'Kobaya... | ['33352', '41025', '35851', '486', '457', '296... | ['14', '11', '8', '5', '5', '4', '4', '3', '2'... | 2023-09-29 | 2024-03-22 | 4.0 | 1 |
1 | 5114 | Fullmetal Alchemist: Brotherhood | TV | 64.0 | Finished Airing | ['Aniplex', 'Square Enix', 'Mainichi Broadcast... | ['Funimation', 'Aniplex of America'] | ['Bones'] | Manga | ['Action', 'Adventure', 'Drama', 'Fantasy', 'M... | ... | 3460 | 50602 | After a horrific alchemy experiment goes wrong... | ['Park, Romi', 'Kugimiya, Rie', 'Miki, Shinich... | ['11061', '16498', '1482', '38000', '9919', '1... | ['74', '44', '21', '17', '16', '14', '14', '9'... | 2009-04-05 | 2010-07-04 | 2.0 | 2 |
2 | 9253 | Steins;Gate | TV | 24.0 | Finished Airing | ['Frontier Works', 'Media Factory', 'Kadokawa ... | ['Funimation'] | ['White Fox'] | Visual novel | ['Drama', 'Sci-Fi', 'Suspense', 'Psychological... | ... | 2868 | 10054 | Eccentric scientist Rintarou Okabe has a never... | ['Miyano, Mamoru', 'Imai, Asami', 'Hanazawa, K... | ['31043', '31240', '9756', '10620', '2236', '4... | ['132', '130', '48', '26', '24', '19', '19', '... | 2011-04-06 | 2011-09-14 | 2.0 | 3 |
3 | 28977 | Gintama° | TV | 51.0 | Finished Airing | ['TV Tokyo', 'Aniplex', 'Dentsu'] | ['Funimation', 'Crunchyroll'] | ['Bandai Namco Pictures'] | Manga | ['Action', 'Comedy', 'Sci-Fi', 'Gag Humor', 'H... | ... | 1477 | 8616 | Gintoki, Shinpachi, and Kagura return as the f... | ['Sugita, Tomokazu', 'Kugimiya, Rie', 'Sakaguc... | ['9863', '30276', '33255', '37105', '6347', '3... | ['3', '2', '1', '1', '1', '1', '1', '1', '1', ... | 2015-04-08 | 2016-03-30 | 2.0 | 4 |
4 | 38524 | Shingeki no Kyojin Season 3 Part 2 | TV | 10.0 | Finished Airing | ['Production I.G', 'Dentsu', 'Mainichi Broadca... | ['Funimation'] | ['Wit Studio'] | Manga | ['Action', 'Drama', 'Suspense', 'Gore', 'Milit... | ... | 1308 | 12803 | Seeking to restore humanity's diminishing hope... | ['Kamiya, Hiroshi', 'Kaji, Yuuki', 'Ishikawa, ... | ['28623', '37521', '25781', '2904', '36649', '... | ['1', '1', '1', '1', '1', '1', '1', '1', '1', ... | 2019-04-29 | 2019-07-01 | 2.0 | 5 |
5 rows × 40 columns
df_info.columns
Index(['MAL_Id', 'Name', 'Type', 'Episodes', 'Status', 'Producers',
'Licensors', 'Studios', 'Source', 'Genres', 'Duration', 'Rating',
'Score', 'Popularity', 'Members', 'Favorites', 'Watching', 'Completed',
'On-Hold', 'Dropped', 'Plan to Watch', 'Total', 'Score-10', 'Score-9',
'Score-8', 'Score-7', 'Score-6', 'Score-5', 'Score-4', 'Score-3',
'Score-2', 'Score-1', 'Synopsis', 'Voice_Actors', 'Recommended_Ids',
'Recommended_Counts', 'Aired_Start', 'Aired_End', 'Premiered_Season',
'Rank'],
dtype='object')
If we want to create a model that will provide recommendations based on the contents of a title, features such as ‘Recommended_Counts’, ‘Aired_Start’, ‘Aired_End’, ‘Premiered_Season’, ‘Rank’, and the granular Scores/Interaction features can be excluded when modeling as they do not provide valuable information.
We can retain the ‘Score’ and ‘Popularity’ metrics to serve as aggregates of the contents of each title, treating them as attributes of their titles as they do not provide any granular information on user preferences.
Hybrid Data From User Reviews
df_review = pd.read_csv('cleaned_anime_reviews.csv')
df_review.head()
review_id | MAL_Id | Review | Tags | |
---|---|---|---|---|
0 | 0 | 52991 | With lives so short, why do we even bother? To... | Recommended |
1 | 0 | 52991 | With lives so short, why do we even bother? To... | Preliminary |
2 | 1 | 52991 | Frieren is the most overrated anime of this de... | Not-Recommended |
3 | 1 | 52991 | Frieren is the most overrated anime of this de... | Funny |
4 | 1 | 52991 | Frieren is the most overrated anime of this de... | Preliminary |
df_review.Tags.value_counts()
Tags
Recommended 48344
Mixed-Feelings 15160
Not-Recommended 14413
Preliminary 13187
Funny 846
Well-written 250
Informative 130
Creative 2
Name: count, dtype: int64
# Only retain Recommended, Mixed-Feelings, Not-Recommended, for collaborative data
# Entire Review data will be vectorized to provide content and collaborative data
An approach to using the review dataset would be to process the text information from each review to extract additional content information provided for the titles by actual users, and the tags associated with each review can be processed to get user sentiments.
However, we will be skipping the reviews dataset in this notebook to focus more on the creation of the recommendation systems with the other two datasets. Further exploration of including these review data will be conducted in the future.
Collaborative Data From User Ratings
df_ratings = pd.read_csv('cleaned_user_ratings.csv')
df_ratings.head()
Username | User_Id | Anime_Id | Anime_Title | Rating_Status | Rating_Score | Num_Epi_Watched | Is_Rewatching | Updated | Start_Date | |
---|---|---|---|---|---|---|---|---|---|---|
0 | flerbz | 0 | 30654 | Ansatsu Kyoushitsu 2nd Season | watching | 0 | 24 | False | 2022-02-26 22:15:01+00:00 | 2022-01-29 |
1 | flerbz | 0 | 22789 | Barakamon | dropped | 0 | 2 | False | 2023-01-28 19:03:33+00:00 | 2022-04-06 |
2 | flerbz | 0 | 31964 | Boku no Hero Academia | completed | 0 | 13 | False | 2024-03-31 02:10:32+00:00 | 2024-03-30 |
3 | flerbz | 0 | 33486 | Boku no Hero Academia 2nd Season | completed | 0 | 25 | False | 2024-03-31 22:32:02+00:00 | 2024-03-30 |
4 | flerbz | 0 | 36456 | Boku no Hero Academia 3rd Season | watching | 0 | 24 | False | 2024-04-03 02:08:56+00:00 | 2024-03-31 |
Within this dataset our main feature would be ‘Rating_Score’ and their corresponding ‘User_Id’ and ‘Anime_Id’, so that we can map out each user’s preferences and use that information to find other users with similar preferences to obtain recommendations from.
‘Rating_Status’ may also be a useful feature that will tell us how the user has interacted with a particular title. “Planning to watch” a title suggests that the user knows of and is already interested in the title, while “Completed” will tells us that the user likes the title enough to finish it, making “Completed” > “Planning to watch” in terms of user interaction.
Next, we would also be interested in removing users that have too few entries in their list as they will just increase computational load without providing the same level of information for our model.
tmp = [(df_ratings.value_counts('User_Id')>=i).sum() for i in range(5,305,5)]
tmp = pd.DataFrame({"Cutoff":list(range(5,305,5)), "Users":tmp})
sns.lineplot(tmp, x='Cutoff', y='Users')
<Axes: xlabel='Cutoff', ylabel='Users'>
The number of users decreases linearly with increasing cutoff points (number of titles in their list). Since there is not an obvious cutoff that we can select from this, we shall arbitrarily set it to 20 interactions removing all users that have less than 20 titles in their list.
print(f'Number of unique users : {df_ratings.User_Id.nunique()}')
print(f'Number of user interactions : {df_ratings.shape[0]}')
Number of unique users : 17513
Number of user interactions : 5452192
tmp = (df_ratings.value_counts('User_Id') >= 20).reset_index()
tmp = tmp[tmp['count']==True]
df_ratings = df_ratings[df_ratings.User_Id.isin(tmp.User_Id)]
print(f'After removing users with less than 20 interactions')
print(f'Number of unique users : {df_ratings.User_Id.nunique()}')
print(f'Number of user interactions : {df_ratings.shape[0]}')
After removing users with less than 20 interactions
Number of unique users : 16744
Number of user interactions : 5445702
We see that we have 16744 users left, a decrease of aout 4.4% from the original 17513 users.
3. Baseline Model & Evaluation
As a baseline we will simply recommend users the most popular titles (based on popularity metric) that is not on their ratings list.
df_info[['MAL_Id','Name','Score','Popularity','Rank']].head()
MAL_Id | Name | Score | Popularity | Rank | |
---|---|---|---|---|---|
0 | 52991 | Sousou no Frieren | 9.276142 | 301 | 1 |
1 | 5114 | Fullmetal Alchemist: Brotherhood | 8.941080 | 3 | 2 |
2 | 9253 | Steins;Gate | 8.962588 | 13 | 3 |
3 | 28977 | Gintama° | 8.726812 | 341 | 4 |
4 | 38524 | Shingeki no Kyojin Season 3 Part 2 | 9.019487 | 21 | 5 |
def mask_user_ratings(user_ratings_df, random_state=42):
# Sample half of rated titles as input
input_df = user_ratings_df[user_ratings_df['Rating_Score']>0].sample(frac=0.5, random_state=random_state)
val_df = user_ratings_df.drop(input_df.index)
return input_df, val_df
The above function splits the user ratings dataset into an input and validation splits by sampling a user’s rated titles into the input split and placing the remaining titles into the validation split.
This approach will allow us to provide a subset of the ground truth (user’s ratings) into our recommendation system as input, and evaluate the recommendations against the remaining subset of the ground truth.
class PopularityRec:
def __init__(self, df, anime_info_df=None):
self.popularity = df
self.anime_info = anime_info_df
def predict(self, user_ratings_df, topn=10, left_on='MAL_Id', right_on='Anime_Id'):
rec_df = self.popularity.sort_values('Popularity', ascending=True)
rec_df = rec_df.merge(user_ratings_df, how='left', left_on=left_on, right_on=right_on)
return rec_df.loc[rec_df[right_on].isna()][self.popularity.columns][:topn]
The above code creates a Popularity Recommendation System object that will help create predictions based on the most popular titles that the input user has not interacted with, and will be the baseline model in this notebook.
Evaluation
Two metrics will be used here
-
Mean Reciprocal Rank
\[MRR = \dfrac{1}{N}\sum_{n=1}^N \dfrac{1}{rank_i}\]
Where N is the total number of users, n is the nth user, i is the position of the first relevant item found within the recommendations
-
Normalized Discounted Cumulative Gain
\(NDCG@K = \dfrac{DCG@K}{IdealDCG@K}\) \(DCG@K = \sum_{k=1}^K \dfrac{rel_k}{log_2 (k+1)}\)
Where K is the total number of top recommendations we are evaluating, k is the kth highest predicted recommendation, rel_k is the relevance score of the recommendation at position k
IdealDCG@K is calculated by sorted the recommendations@K from order of highest relevance to lowest relevance before calculating the DGC@K. This will return the maximum achieveable DCG for the same set of ranked recommendations@K.
Essentially, the evaluation process in this notebook would be:
- - Each user will have part of their ratings list randomly sampled to use as input data
- - Recommendation system takes in input data and returns a set of ranked recommendations
- - Remaining part of the user's rating list will be used as validation data
- - MRR/NDCG will be calcualted from this validation data recommendations and the masked portion of the user's rating list
class ModelEvaluator:
def evaluate_mrr(self, ranked_rec_df, user_input_df, user_val_df, weight=1, topn=10, left_on='MAL_Id', right_on='Anime_Id'):
scoring_df = ranked_rec_df.merge(user_val_df, how='left', left_on=left_on, right_on=right_on)
scoring_df = scoring_df.loc[~scoring_df[right_on].isna()][:topn]
matched_idx = list(scoring_df[scoring_df[right_on].isin(user_val_df[right_on])].index)
if not matched_idx:
return 0
return (1 * weight) / (matched_idx[0] + 1)
def evaluate_ndcg(self, ranked_rec_df, user_input_df, user_val_df, weight=1, topn=10, left_on='MAL_Id', right_on='Anime_Id'):
scoring_df = ranked_rec_df.merge(user_val_df, how='left', left_on=left_on, right_on=right_on)
scoring_df = scoring_df.iloc[:topn]
# Calculate relevance score based on how well the user interaction went
for i in range(len(scoring_df)):
scoring_df['rel'] = 0.0
scoring_df.loc[scoring_df.Rating_Score == 0, 'rel'] = 0.5
scoring_df.loc[scoring_df.Rating_Score > 0, 'rel'] = 1
scoring_df.loc[scoring_df.Rating_Score > 5, 'rel'] = 2
scoring_df.loc[scoring_df.Rating_Score > 8, 'rel'] = 3
cg, icg = list(scoring_df['rel']) , sorted(scoring_df['rel'], reverse=True)
if not cg or max(cg) == 0:
return 0
icg = sorted(cg, reverse=True)
cg = list(np.array(cg) / np.array([math.log(i+1, 2) for i in range(1,len(cg) + 1)]))
icg = list(np.array(icg) / np.array([math.log(i+1, 2) for i in range(1,len(icg) + 1)]))
ndcg = sum(cg) / sum(icg)
return ndcg
test_input_df, test_val_df = mask_user_ratings(df_ratings[df_ratings['User_Id']==47], random_state=1)
tester_rec = PopularityRec(df_info, df_info)
pred_df = tester_rec.predict(test_input_df, topn=10)
pred_df.head()
MAL_Id | Name | Type | Episodes | Status | Producers | Licensors | Studios | Source | Genres | ... | Score-2 | Score-1 | Synopsis | Voice_Actors | Recommended_Ids | Recommended_Counts | Aired_Start | Aired_End | Premiered_Season | Rank | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 16498 | Shingeki no Kyojin | TV | 25.0 | Finished Airing | ['Production I.G', 'Dentsu', 'Mainichi Broadca... | ['Funimation'] | ['Wit Studio'] | Manga | ['Action', 'Award Winning', 'Drama', 'Suspense... | ... | 3828 | 9049 | Centuries ago, mankind was slaughtered to near... | ['Kaji, Yuuki', 'Ishikawa, Yui', 'Inoue, Marin... | ['28623', '37779', '26243', '20787', '5114', '... | ['111', '49', '49', '45', '44', '42', '36', '3... | 2013-04-07 | 2013-09-29 | 2.0 | 109 |
1 | 1535 | Death Note | TV | 37.0 | Finished Airing | ['VAP', 'Nippon Television Network', 'Shueisha... | ['VIZ Media'] | ['Madhouse'] | Manga | ['Supernatural', 'Suspense', 'Psychological', ... | ... | 3238 | 5382 | Brutal murders, petty thefts, and senseless vi... | ['Yamaguchi, Kappei', 'Miyano, Mamoru', 'Nakam... | ['1575', '19', '23283', '10620', '13601', '290... | ['633', '113', '95', '74', '67', '52', '50', '... | 2006-10-04 | 2007-06-27 | 4.0 | 79 |
2 | 5114 | Fullmetal Alchemist: Brotherhood | TV | 64.0 | Finished Airing | ['Aniplex', 'Square Enix', 'Mainichi Broadcast... | ['Funimation', 'Aniplex of America'] | ['Bones'] | Manga | ['Action', 'Adventure', 'Drama', 'Fantasy', 'M... | ... | 3460 | 50602 | After a horrific alchemy experiment goes wrong... | ['Park, Romi', 'Kugimiya, Rie', 'Miki, Shinich... | ['11061', '16498', '1482', '38000', '9919', '1... | ['74', '44', '21', '17', '16', '14', '14', '9'... | 2009-04-05 | 2010-07-04 | 2.0 | 2 |
3 | 30276 | One Punch Man | TV | 12.0 | Finished Airing | ['TV Tokyo', 'Bandai Visual', 'Lantis', 'Asats... | ['VIZ Media'] | ['Madhouse'] | Web manga | ['Action', 'Comedy', 'Adult Cast', 'Parody', '... | ... | 2027 | 3701 | The seemingly unimpressive Saitama has a rathe... | ['Furukawa, Makoto', 'Ishikawa, Kaito', 'Yuuki... | ['32182', '31964', '33255', '29803', '918', '5... | ['163', '94', '26', '21', '16', '16', '11', '1... | 2015-10-05 | 2015-12-21 | 4.0 | 129 |
5 | 38000 | Kimetsu no Yaiba | TV | 26.0 | Finished Airing | ['Aniplex', 'Shueisha'] | ['Aniplex of America'] | ['ufotable'] | Manga | ['Action', 'Award Winning', 'Fantasy', 'Histor... | ... | 2354 | 6186 | Ever since the death of his father, the burden... | ['Hanae, Natsuki', 'Shimono, Hiro', 'Kitou, Ak... | ['40748', '37520', '16498', '269', '5114', '31... | ['70', '42', '20', '20', '17', '15', '12', '11... | 2019-04-06 | 2019-09-28 | 2.0 | 143 |
5 rows × 40 columns
As a sanity check we have made some recommendations using User Id 47’s ratings list as shown above.
tester_eval = ModelEvaluator()
print('MRR : ', tester_eval.evaluate_mrr(pred_df, test_input_df, test_val_df, topn=10))
print('NDCG : ', tester_eval.evaluate_ndcg(pred_df, test_input_df, test_val_df, topn=10))
MRR : 0.2
NDCG : 0.430624116386567
Evaluating the predictions also reveals the above scores.
Next we shall make and evaluate recommendations for all the users in our dataset.
#calculate baseline performance
count = 0
total_mrr = 0
total_ndcg = 0
mrr_base, ndcg_base = [], []
for i in df_ratings.User_Id.unique():
count += 1
test_input_df, test_val_df = mask_user_ratings(df_ratings[df_ratings['User_Id']==i], random_state=1)
pred_df = tester_rec.predict(test_input_df, topn=10)
mrr = tester_eval.evaluate_mrr(pred_df, test_input_df, test_val_df, topn=10)
ndcg = tester_eval.evaluate_ndcg(pred_df, test_input_df, test_val_df, topn=10)
total_mrr += mrr
total_ndcg += ndcg
mrr_base.append(mrr)
ndcg_base.append(ndcg)
def running_avg(scores):
avgs = np.cumsum(scores)/np.array(range(1, len(scores) + 1))
return avgs
print(f'Baseline MRR: {running_avg(mrr_base)[-1]}')
print(f'Baseline NDCG: {running_avg(ndcg_base)[-1]}')
Baseline MRR: 0.657765657823893
Baseline NDCG: 0.6612426800967172
Above we see our baseline performance that we want to beat.
4. Content Based Filtering
In this section we will explore content based filtering, where only information about the titles (i.e. descriptions, content attributes) will be used to recommend similar items to users based on their stated preferences.
As stated earlier in this notebook, we will be treating “Score” and “Popularity” features as content attributes tied to each title as they do not provide any granular user preferences information.
The underlying assumption here is that users who have interacted with certain of titles will likely enjoy other similar titles. A limitation of this is that recommendations may not be as diverse, where for example a user who has interacted with mostly Action titles will mainly be recommended similar Action titles with little deviation into other types of titles.
Titles similarities will be calculated using cosine similarity, where two similar titles should be pointing in the same direction within an inner product space.
First we will start with some data preprocessing.
df_content = df_info.copy().drop(['Aired_Start','Aired_End','Premiered_Season','Rank','Recommended_Ids','Recommended_Counts','Score-10', 'Score-9',
'Score-8', 'Score-7', 'Score-6', 'Score-5', 'Score-4', 'Score-3',
'Score-2', 'Score-1','Total','Watching','Completed','On-Hold','Dropped','Plan to Watch','Status','Source'], axis=1)
df_content.head()
MAL_Id | Name | Type | Episodes | Producers | Licensors | Studios | Genres | Duration | Rating | Score | Popularity | Members | Favorites | Synopsis | Voice_Actors | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 52991 | Sousou no Frieren | TV | 28.0 | ['Aniplex', 'Dentsu', 'Shogakukan-Shueisha Pro... | ['None found', 'add some'] | ['Madhouse'] | ['Adventure', 'Drama', 'Fantasy', 'Shounen'] | 24 min. per ep. | PG-13 - Teens 13 or older | 9.276142 | 301 | 670859 | 35435 | During their decade-long quest to defeat the D... | ['Tanezaki, Atsumi', 'Ichinose, Kana', 'Kobaya... |
1 | 5114 | Fullmetal Alchemist: Brotherhood | TV | 64.0 | ['Aniplex', 'Square Enix', 'Mainichi Broadcast... | ['Funimation', 'Aniplex of America'] | ['Bones'] | ['Action', 'Adventure', 'Drama', 'Fantasy', 'M... | 24 min. per ep. | R - 17+ (violence & profanity) | 8.941080 | 3 | 3331144 | 225215 | After a horrific alchemy experiment goes wrong... | ['Park, Romi', 'Kugimiya, Rie', 'Miki, Shinich... |
2 | 9253 | Steins;Gate | TV | 24.0 | ['Frontier Works', 'Media Factory', 'Kadokawa ... | ['Funimation'] | ['White Fox'] | ['Drama', 'Sci-Fi', 'Suspense', 'Psychological... | 24 min. per ep. | PG-13 - Teens 13 or older | 8.962588 | 13 | 2553356 | 189031 | Eccentric scientist Rintarou Okabe has a never... | ['Miyano, Mamoru', 'Imai, Asami', 'Hanazawa, K... |
3 | 28977 | Gintama° | TV | 51.0 | ['TV Tokyo', 'Aniplex', 'Dentsu'] | ['Funimation', 'Crunchyroll'] | ['Bandai Namco Pictures'] | ['Action', 'Comedy', 'Sci-Fi', 'Gag Humor', 'H... | 24 min. per ep. | PG-13 - Teens 13 or older | 8.726812 | 341 | 628071 | 16610 | Gintoki, Shinpachi, and Kagura return as the f... | ['Sugita, Tomokazu', 'Kugimiya, Rie', 'Sakaguc... |
4 | 38524 | Shingeki no Kyojin Season 3 Part 2 | TV | 10.0 | ['Production I.G', 'Dentsu', 'Mainichi Broadca... | ['Funimation'] | ['Wit Studio'] | ['Action', 'Drama', 'Suspense', 'Gore', 'Milit... | 23 min. per ep. | R - 17+ (violence & profanity) | 9.019487 | 21 | 2262916 | 58383 | Seeking to restore humanity's diminishing hope... | ['Kamiya, Hiroshi', 'Kaji, Yuuki', 'Ishikawa, ... |
# Convert duration column to number of minutes
def convert_duration(duration):
duration = duration.split(' ')
duration_mins = 0
curr_min = 1/60
for char in duration[::-1]:
if 'min' in char:
curr_min = 1
elif 'hr' in char:
curr_min = 60
elif char.isnumeric():
duration_mins += int(char) * curr_min
return duration_mins
df_content.Duration = df_content.Duration.apply(convert_duration)
df_content.Duration.head()
0 24.0
1 24.0
2 24.0
3 24.0
4 23.0
Name: Duration, dtype: float64
# Onehotencode Genre
genres = df_content['Genres'].apply(literal_eval).explode()
genres = 'genre_' + genres
genres = genres.fillna('genre_na')
df_content = df_content.drop('Genres', axis = 1).join(pd.crosstab(genres.index, genres))
df_content.head()
MAL_Id | Name | Type | Episodes | Producers | Licensors | Studios | Duration | Rating | Score | ... | genre_Supernatural | genre_Survival | genre_Suspense | genre_Team Sports | genre_Time Travel | genre_Vampire | genre_Video Game | genre_Visual Arts | genre_Workplace | genre_na | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 52991 | Sousou no Frieren | TV | 28.0 | ['Aniplex', 'Dentsu', 'Shogakukan-Shueisha Pro... | ['None found', 'add some'] | ['Madhouse'] | 24.0 | PG-13 - Teens 13 or older | 9.276142 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 5114 | Fullmetal Alchemist: Brotherhood | TV | 64.0 | ['Aniplex', 'Square Enix', 'Mainichi Broadcast... | ['Funimation', 'Aniplex of America'] | ['Bones'] | 24.0 | R - 17+ (violence & profanity) | 8.941080 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 9253 | Steins;Gate | TV | 24.0 | ['Frontier Works', 'Media Factory', 'Kadokawa ... | ['Funimation'] | ['White Fox'] | 24.0 | PG-13 - Teens 13 or older | 8.962588 | ... | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
3 | 28977 | Gintama° | TV | 51.0 | ['TV Tokyo', 'Aniplex', 'Dentsu'] | ['Funimation', 'Crunchyroll'] | ['Bandai Namco Pictures'] | 24.0 | PG-13 - Teens 13 or older | 8.726812 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 38524 | Shingeki no Kyojin Season 3 Part 2 | TV | 10.0 | ['Production I.G', 'Dentsu', 'Mainichi Broadca... | ['Funimation'] | ['Wit Studio'] | 23.0 | R - 17+ (violence & profanity) | 9.019487 | ... | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 90 columns
# Labelencode Type, Rating
cols = ['Type','Rating']
for col in cols:
le = LabelEncoder()
df_content[col] = le.fit_transform(df_content[col])
df_content[cols].head()
Type | Rating | |
---|---|---|
0 | 4 | 2 |
1 | 4 | 3 |
2 | 4 | 2 |
3 | 4 | 2 |
4 | 4 | 3 |
Above we have encoded the relevant categorical features found in the dataset, next we will need to vectorize the remaining features.
# Count Vectorize Name, Producers, Licensors, Studios, Voice_Actors,
cols = ['Name','Producers','Licensors','Studios','Voice_Actors']
sparse_total=[]
for col in cols:
df_content[col].apply(lambda x: '' if pd.isna(x) else x.strip('[]'))
vec = CountVectorizer()
tmp = df_content[col]
sparse_tmp = vec.fit_transform(tmp)
if isinstance(sparse_total,list):
sparse_total=sparse_tmp
else:
sparse_total = hstack((sparse_total, sparse_tmp))
sparse_total
<13300x18768 sparse matrix of type '<class 'numpy.int64'>'
with 332269 stored elements in Compressed Sparse Row format>
# TFIDF Vectorize Synopsis to place emphasis on words with less occurences
sw = stopwords.words('english')
tfidf_vec = TfidfVectorizer(analyzer='word',
ngram_range=(1,2),
max_df=0.5,
min_df=0.001,
stop_words=sw)
sparse_tfidf = tfidf_vec.fit_transform(df_content['Synopsis'])
sparse_tfidf
<13300x6754 sparse matrix of type '<class 'numpy.float64'>'
with 488956 stored elements in Compressed Sparse Row format>
For this approach we will not be combining the various arrays into a single dataframe as input. Instead we will leave them in three separate arrays
- - Dense dataframe containing numerical and categorical columns
- - Sparse array containing count vectorized features
- - Sparse array containing tfdidf vectorized synopsis
to decrease computational costs when calculating similarities between titles in the array. Recommendations made from each of the three arrays will contribute to a final list of recommendations.
df_dense = df_content.drop(['Name','Producers','Licensors','Studios','Voice_Actors','Synopsis'], axis=1)
df_dense.head()
MAL_Id | Type | Episodes | Duration | Rating | Score | Popularity | Members | Favorites | genre_Action | ... | genre_Supernatural | genre_Survival | genre_Suspense | genre_Team Sports | genre_Time Travel | genre_Vampire | genre_Video Game | genre_Visual Arts | genre_Workplace | genre_na | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 52991 | 4 | 28.0 | 24.0 | 2 | 9.276142 | 301 | 670859 | 35435 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 5114 | 4 | 64.0 | 24.0 | 3 | 8.941080 | 3 | 3331144 | 225215 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 9253 | 4 | 24.0 | 24.0 | 2 | 8.962588 | 13 | 2553356 | 189031 | 0 | ... | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
3 | 28977 | 4 | 51.0 | 24.0 | 2 | 8.726812 | 341 | 628071 | 16610 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 38524 | 4 | 10.0 | 23.0 | 3 | 9.019487 | 21 | 2262916 | 58383 | 1 | ... | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 84 columns
print("Missing Episodes: ", df_dense.Episodes.isna().sum())
df_dense.Episodes = df_dense.Episodes.fillna(0)
print("Missing Episodes after fillna: ", df_dense.Episodes.isna().sum())
Missing Episodes: 55
Missing Episodes after fillna: 0
scale_cols = ['Score','Members','Favorites','Episodes']
ss = StandardScaler()
df_dense[scale_cols] = ss.fit_transform(df_dense[scale_cols])
df_dense.head()
MAL_Id | Type | Episodes | Duration | Rating | Score | Popularity | Members | Favorites | genre_Action | ... | genre_Supernatural | genre_Survival | genre_Suspense | genre_Team Sports | genre_Time Travel | genre_Vampire | genre_Video Game | genre_Visual Arts | genre_Workplace | genre_na | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 52991 | 4 | 0.271027 | 24.0 | 2 | 2.745298 | 301 | 2.705886 | 5.556849 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 5114 | 4 | 0.952556 | 24.0 | 3 | 2.419119 | 3 | 14.743216 | 36.053453 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 9253 | 4 | 0.195302 | 24.0 | 2 | 2.440056 | 13 | 11.223860 | 30.238883 | 0 | ... | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
3 | 28977 | 4 | 0.706449 | 24.0 | 2 | 2.210530 | 341 | 2.512278 | 2.531775 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 38524 | 4 | -0.069737 | 23.0 | 3 | 2.495447 | 21 | 9.909669 | 9.244467 | 1 | ... | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 84 columns
class ContentBasedRecommender:
def __init__(self, df_content):
self.df_content = df_content
self.df_dense, self.sparse_vec, self.sparse_tfidf = self.process_df(self.df_content)
self.ref_weights = [1/math.log(len(self.df_content)-i+1, 10) + 1 for i in range(len(self.df_content))]
def process_df(self, df_content):
genres=df_content['Genres'].apply(literal_eval).explode()
genres = 'genre_' + genres
genres = genres.fillna('genre_na')
df_content = df_content.drop('Genres', axis=1).join(pd.crosstab(genres.index, genres))
#labelencode
for col in ['Type','Rating']:
le=LabelEncoder()
df_content[col] = le.fit_transform(df_content[col])
#Vectorize
sparse_vec=[]
for col in ['Name','Producers','Licensors','Studios','Voice_Actors']:
df_content[col].apply(lambda x: '' if pd.isna(x) else x.strip('[]'))
vec = CountVectorizer()
tmp = df_content[col]
sparse_tmp = vec.fit_transform(tmp)
if isinstance(sparse_vec,list):
sparse_vec = sparse_tmp
else:
sparse_vec = hstack((sparse_vec, sparse_tmp))
tfidf_vec = TfidfVectorizer(analyzer='word',
ngram_range=(1,2),
max_df=0.5,
min_df=0.001,
stop_words=sw)
sparse_tfidf = tfidf_vec.fit_transform(df_content['Synopsis'])
df_dense = df_content.drop(['Name','Producers','Licensors','Studios','Voice_Actors','Synopsis'], axis=1)
df_dense.Episodes = df_dense.Episodes.fillna(0)
scale_cols = ['Score','Members','Favorites','Episodes']
ss = StandardScaler()
df_dense[scale_cols] = ss.fit_transform(df_dense[scale_cols])
return df_dense, sparse_vec, sparse_tfidf
self.df_dense = df_dense
self.sparse_vec = sparse_vec
self.sparse_tfidf = sparse_tfidf
def get_entry(self, MAL_Id):
title_dense = self.df_dense[self.df_dense['MAL_Id'] == MAL_Id]
idx = title_dense.index[0]
title_vec = self.sparse_vec[idx]
title_tfidf = self.sparse_tfidf[idx]
return title_dense, title_vec, title_tfidf
def calc_sim(self, MAL_Id):
try:
title_dense, title_vec, title_tfidf = self.get_entry(MAL_Id)
except:
return None
sim_dense = cosine_similarity(title_dense, self.df_dense)
sim_vec = cosine_similarity(title_vec, self.sparse_vec)
sim_tfidf = cosine_similarity(title_tfidf, self.sparse_tfidf)
total = (sim_dense + sim_vec + sim_tfidf).argsort().flatten()
return total
def predict_weights(self, user_list):
weights_df = pd.DataFrame({'Preds': self.df_content.MAL_Id, 'Weights':0})
for MAL_Id in user_list:
recs = self.calc_sim(MAL_Id)
if recs is None:
continue
idx_recs = list(recs)
weights_zip = list(zip(idx_recs, self.ref_weights))
weights_zip = sorted(weights_zip)
weights_zip = list(zip(*weights_zip))
weights_df['Weights'] += weights_zip[1]
weights_df['Weights'] = (weights_df['Weights'] - weights_df['Weights'].min()) / (weights_df['Weights'].max() - weights_df['Weights'].min())
return weights_df
def par_weights(self, user_list):
weights_df = pd.DataFrame({'Preds': self.df_content.MAL_Id, 'Weights':0})
recs_list=[]
with Pool() as pool:
for recs in pool.imap(self.calc_sim, user_list):
if recs is None:
continue
recs_list.append(recs)
for recs in recs_list:
idx_recs = list(recs)
weights_zip = list(zip(idx_recs, self.ref_weights))
weights_zip = sorted(weights_zip)
weights_zip = list(zip(*weights_zip))
weights_df['Weights'] += weights_zip[1]
weights_df['Weights'] = (weights_df['Weights'] - weights_df['Weights'].min()) / (weights_df['Weights'].max() - weights_df['Weights'].min())
return weights_df
def par_predict(self, user_df, topn=10):
user_list = list(user_df['Anime_Id'])
weights_df = self.par_weights(user_list)
res = weights_df.merge(self.df_content, how='left', left_on='Preds', right_on='MAL_Id')
res = res.sort_values('Weights', ascending=False).loc[~res['MAL_Id'].isin(user_list)][:topn]
return res
def predict(self, user_df, topn=10):
user_list = list(user_df['Anime_Id'])
weights_df = self.predict_weights(user_list)
res = weights_df.merge(self.df_content, how='left', left_on='Preds', right_on='MAL_Id')
res = res.sort_values('Weights', ascending=False).loc[~res['MAL_Id'].isin(user_list)][:topn]
return res
The above code creates a Content Based Recommendation System object that will be able to process input datasets and make recommendations. During experimentation, calculation of cosine similarity was expensive and taking too long, hence within the class I have written functions that will do the calculations in parallel to speed things up.
df_content = df_info.copy().drop(['Aired_Start','Aired_End','Premiered_Season','Rank','Recommended_Ids','Recommended_Counts','Score-10', 'Score-9',
'Score-8', 'Score-7', 'Score-6', 'Score-5', 'Score-4', 'Score-3',
'Score-2', 'Score-1','Total','Watching','Completed','On-Hold','Dropped','Plan to Watch','Status','Source'], axis=1)
df_content.Duration = df_content.Duration.apply(convert_duration)
df_content.Duration.head()
df_content.head()
MAL_Id | Name | Type | Episodes | Producers | Licensors | Studios | Genres | Duration | Rating | Score | Popularity | Members | Favorites | Synopsis | Voice_Actors | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 52991 | Sousou no Frieren | TV | 28.0 | ['Aniplex', 'Dentsu', 'Shogakukan-Shueisha Pro... | ['None found', 'add some'] | ['Madhouse'] | ['Adventure', 'Drama', 'Fantasy', 'Shounen'] | 24.0 | PG-13 - Teens 13 or older | 9.276142 | 301 | 670859 | 35435 | During their decade-long quest to defeat the D... | ['Tanezaki, Atsumi', 'Ichinose, Kana', 'Kobaya... |
1 | 5114 | Fullmetal Alchemist: Brotherhood | TV | 64.0 | ['Aniplex', 'Square Enix', 'Mainichi Broadcast... | ['Funimation', 'Aniplex of America'] | ['Bones'] | ['Action', 'Adventure', 'Drama', 'Fantasy', 'M... | 24.0 | R - 17+ (violence & profanity) | 8.941080 | 3 | 3331144 | 225215 | After a horrific alchemy experiment goes wrong... | ['Park, Romi', 'Kugimiya, Rie', 'Miki, Shinich... |
2 | 9253 | Steins;Gate | TV | 24.0 | ['Frontier Works', 'Media Factory', 'Kadokawa ... | ['Funimation'] | ['White Fox'] | ['Drama', 'Sci-Fi', 'Suspense', 'Psychological... | 24.0 | PG-13 - Teens 13 or older | 8.962588 | 13 | 2553356 | 189031 | Eccentric scientist Rintarou Okabe has a never... | ['Miyano, Mamoru', 'Imai, Asami', 'Hanazawa, K... |
3 | 28977 | Gintama° | TV | 51.0 | ['TV Tokyo', 'Aniplex', 'Dentsu'] | ['Funimation', 'Crunchyroll'] | ['Bandai Namco Pictures'] | ['Action', 'Comedy', 'Sci-Fi', 'Gag Humor', 'H... | 24.0 | PG-13 - Teens 13 or older | 8.726812 | 341 | 628071 | 16610 | Gintoki, Shinpachi, and Kagura return as the f... | ['Sugita, Tomokazu', 'Kugimiya, Rie', 'Sakaguc... |
4 | 38524 | Shingeki no Kyojin Season 3 Part 2 | TV | 10.0 | ['Production I.G', 'Dentsu', 'Mainichi Broadca... | ['Funimation'] | ['Wit Studio'] | ['Action', 'Drama', 'Suspense', 'Gore', 'Milit... | 23.0 | R - 17+ (violence & profanity) | 9.019487 | 21 | 2262916 | 58383 | Seeking to restore humanity's diminishing hope... | ['Kamiya, Hiroshi', 'Kaji, Yuuki', 'Ishikawa, ... |
# Initialise our content based recommender object and evaluator object
content_rec = ContentBasedRecommender(df_content)
tester_eval = ModelEvaluator()
# No multiprocessing
s = datetime.now()
test_input_df, test_val_df = mask_user_ratings(df_ratings[df_ratings['User_Id']==47], random_state=1)
pred_df = content_rec.predict(test_input_df, 10)
print("Final MRR: " , tester_eval.evaluate_mrr(pred_df, test_input_df, test_val_df, topn=10))
print("Final NDCG: " , tester_eval.evaluate_ndcg(pred_df, test_input_df, test_val_df, topn=10))
print(datetime.now()-s)
Final MRR: 1.0
Final NDCG: 0.8332242176357783
0:00:01.134000
pred_df.head()
Preds | Weights | MAL_Id | Name | Type | Episodes | Producers | Licensors | Studios | Genres | Duration | Rating | Score | Popularity | Members | Favorites | Synopsis | Voice_Actors | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
117 | 38474 | 0.771254 | 38474 | Yuru Camp△ Season 2 | TV | 13.0 | ['Half H.P Studio', 'MAGES.', 'DeNA'] | ['None found', 'add some'] | ['C-Station'] | ['Slice of Life', 'CGDCT', 'Iyashikei'] | 23.0 | PG-13 - Teens 13 or older | 8.504338 | 1079 | 222123 | 3039 | Having spent Christmas camping with her new fr... | ['Touyama, Nao', 'Hanamori, Yumiri', 'Toyosaki... |
1593 | 54005 | 0.706282 | 54005 | COLORs | ONA | 1.0 | ['TOHO animation'] | ['None found', 'add some'] | ['Wit Studio'] | ['Drama', 'Crossdressing'] | 3.0 | PG-13 - Teens 13 or older | 7.616992 | 7294 | 7142 | 50 | A girl finds herself mesmerized by a young wom... | [] |
1651 | 37341 | 0.691259 | 37341 | Yuru Camp△ Specials | Special | 3.0 | ['None found', 'add some'] | ['None found', 'add some'] | ['C-Station'] | ['Slice of Life', 'CGDCT', 'Iyashikei'] | 8.0 | PG-13 - Teens 13 or older | 7.581927 | 2934 | 55632 | 90 | When Chiaki Oogaki and Aoi Inuyama start the O... | ['Touyama, Nao', 'Hanamori, Yumiri', 'Toyosaki... |
1904 | 51958 | 0.637903 | 51958 | Kono Subarashii Sekai ni Bakuen wo! | TV | 12.0 | ['Half H.P Studio', 'Nippon Columbia', 'Atelie... | ['None found', 'add some'] | ['Drive'] | ['Comedy', 'Fantasy'] | 23.0 | PG-13 - Teens 13 or older | 7.461288 | 768 | 309112 | 1725 | Megumin is a young and passionate wizard from ... | ['Takahashi, Rie', 'Toyosaki, Aki', 'Fukushima... |
316 | 53888 | 0.580874 | 53888 | Spy x Family Movie: Code: White | Movie | 1.0 | ['TOHO animation', 'Shueisha'] | ['None found', 'add some'] | ['Wit Studio', 'CloverWorks'] | ['Action', 'Comedy', 'Childcare', 'Shounen'] | 110.0 | PG-13 - Teens 13 or older | 8.358056 | 2046 | 101970 | 335 | After receiving an order to be replaced in Ope... | ['Tanezaki, Atsumi', 'Hayami, Saori', 'Eguchi,... |
# Multiprocessing
s = datetime.now()
test_input_df, test_val_df = mask_user_ratings(df_ratings[df_ratings['User_Id']==47], random_state=1)
pred_df = content_rec.par_predict(test_input_df, 10)
print("Final MRR: " , tester_eval.evaluate_mrr(pred_df, test_input_df, test_val_df, topn=10))
print("Final NDCG: " , tester_eval.evaluate_ndcg(pred_df, test_input_df, test_val_df, topn=10))
print(datetime.now()-s)
Final MRR: 1.0
Final NDCG: 0.8332242176357783
0:00:00.720000
pred_df.head()
Preds | Weights | MAL_Id | Name | Type | Episodes | Producers | Licensors | Studios | Genres | Duration | Rating | Score | Popularity | Members | Favorites | Synopsis | Voice_Actors | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
117 | 38474 | 0.771254 | 38474 | Yuru Camp△ Season 2 | TV | 13.0 | ['Half H.P Studio', 'MAGES.', 'DeNA'] | ['None found', 'add some'] | ['C-Station'] | ['Slice of Life', 'CGDCT', 'Iyashikei'] | 23.0 | PG-13 - Teens 13 or older | 8.504338 | 1079 | 222123 | 3039 | Having spent Christmas camping with her new fr... | ['Touyama, Nao', 'Hanamori, Yumiri', 'Toyosaki... |
1593 | 54005 | 0.706282 | 54005 | COLORs | ONA | 1.0 | ['TOHO animation'] | ['None found', 'add some'] | ['Wit Studio'] | ['Drama', 'Crossdressing'] | 3.0 | PG-13 - Teens 13 or older | 7.616992 | 7294 | 7142 | 50 | A girl finds herself mesmerized by a young wom... | [] |
1651 | 37341 | 0.691259 | 37341 | Yuru Camp△ Specials | Special | 3.0 | ['None found', 'add some'] | ['None found', 'add some'] | ['C-Station'] | ['Slice of Life', 'CGDCT', 'Iyashikei'] | 8.0 | PG-13 - Teens 13 or older | 7.581927 | 2934 | 55632 | 90 | When Chiaki Oogaki and Aoi Inuyama start the O... | ['Touyama, Nao', 'Hanamori, Yumiri', 'Toyosaki... |
1904 | 51958 | 0.637903 | 51958 | Kono Subarashii Sekai ni Bakuen wo! | TV | 12.0 | ['Half H.P Studio', 'Nippon Columbia', 'Atelie... | ['None found', 'add some'] | ['Drive'] | ['Comedy', 'Fantasy'] | 23.0 | PG-13 - Teens 13 or older | 7.461288 | 768 | 309112 | 1725 | Megumin is a young and passionate wizard from ... | ['Takahashi, Rie', 'Toyosaki, Aki', 'Fukushima... |
316 | 53888 | 0.580874 | 53888 | Spy x Family Movie: Code: White | Movie | 1.0 | ['TOHO animation', 'Shueisha'] | ['None found', 'add some'] | ['Wit Studio', 'CloverWorks'] | ['Action', 'Comedy', 'Childcare', 'Shounen'] | 110.0 | PG-13 - Teens 13 or older | 8.358056 | 2046 | 101970 | 335 | After receiving an order to be replaced in Ope... | ['Tanezaki, Atsumi', 'Hayami, Saori', 'Eguchi,... |
Sanity check on the same User Id 47 above shows that both calculations from parallel/non-parallel processing functions are the same, and the parallel calculations results in about 50% less computation time.
count, total_mrr, total_ndcg = 0, 0, 0
s = datetime.now()
number_of_samples = 1000
mrr_content, ndcg_content= [], []
for i in np.random.choice(df_ratings.User_Id.unique(), number_of_samples, replace=False):
s_inner = datetime.now()
count += 1
test_input_df, test_val_df = mask_user_ratings(df_ratings[df_ratings['User_Id']==i], random_state=1)
pred_df = content_rec.par_predict(test_input_df, 10)
mrr = tester_eval.evaluate_mrr(pred_df, test_input_df, test_val_df, topn=10)
ndcg = tester_eval.evaluate_ndcg(pred_df, test_input_df, test_val_df, topn=10)
total_mrr += mrr
total_ndcg += ndcg
mrr_content.append(mrr)
ndcg_content.append(ndcg)
As the computational time required is significant higher when compared to our base model, we will evaluate our subsequent models on a subset of 380 out of the 16744 users we have in our dataset to obtain results with 95% confidence interval at 5% margin of error.
print(f'Content MRR: {running_avg(mrr_content)[-1]}')
print(f'Content NDCG: {running_avg(ndcg_content)[-1]}')
Content MRR: 0.6794964285714287
Content NDCG: 0.6826560139876724
We see that our content based recommendation system barely beats our baseline model.
5. Collaborative Filtering
Within this section we will utilise preferences and ratings information from many users to create predictions on what other similar users may be interested in.
The underlying assaumption here is that users with similar preferences and opinions would prefer the same titles as one another.
To make recommendations, the user’s input data will be appended to our ratings data and singular value decomposition (SVD) will be applied to factorize the matrix. Thereafter a dot product of the feature vector corresponding to the input user with the feature vectors corresponding to the titles will return similarity measures that we can use to make recommendations.
The same process can be applied to inputs with multiple users, this can be useful for making recommendations to groups of people looking for new titles to watch together.
df_ratings.head()
Username | User_Id | Anime_Id | Anime_Title | Rating_Status | Rating_Score | Num_Epi_Watched | Is_Rewatching | Updated | Start_Date | |
---|---|---|---|---|---|---|---|---|---|---|
0 | flerbz | 0 | 30654 | Ansatsu Kyoushitsu 2nd Season | watching | 0 | 24 | False | 2022-02-26 22:15:01+00:00 | 2022-01-29 |
1 | flerbz | 0 | 22789 | Barakamon | dropped | 0 | 2 | False | 2023-01-28 19:03:33+00:00 | 2022-04-06 |
2 | flerbz | 0 | 31964 | Boku no Hero Academia | completed | 0 | 13 | False | 2024-03-31 02:10:32+00:00 | 2024-03-30 |
3 | flerbz | 0 | 33486 | Boku no Hero Academia 2nd Season | completed | 0 | 25 | False | 2024-03-31 22:32:02+00:00 | 2024-03-30 |
4 | flerbz | 0 | 36456 | Boku no Hero Academia 3rd Season | watching | 0 | 24 | False | 2024-04-03 02:08:56+00:00 | 2024-03-31 |
def pivot_ratings(df):
df['Mean_Score'] = 0
mean_df = df[df['Rating_Score']>0].groupby("User_Id")['Rating_Score'].mean().reset_index().rename(columns={'Rating_Score':'mean_score'})
df = df.merge(mean_df)
df['Interactions'] = 0.0
df.loc[df.Rating_Score == 0, 'Interactions'] = 2
df.loc[df.Rating_Score-df.Mean_Score < 0, 'Interactions'] = 1
df.loc[df.Rating_Score-df.Mean_Score == 0, 'Interactions'] = 3
df.loc[df.Rating_Score-df.Mean_Score > 0, 'Interactions'] = 4
df = df.pivot(index='User_Id', columns='Anime_Id', values='Interactions').fillna(0)
return df
The above function calculates the mean ratings per rated title for each user, and subtracts this mean from all of the ratings the user has made to remove rating biases. An interaction score is then computed based on how well the user rated the interaction. A pivot is then applied to the dataframe preparing it for modeling.
df_cf = df_ratings.copy()
df_cf = pivot_ratings(df_cf)
df_cf.head()
Anime_Id | 1 | 5 | 6 | 7 | 8 | 15 | 16 | 17 | 18 | 19 | ... | 58564 | 58567 | 58569 | 58572 | 58573 | 58592 | 58600 | 58603 | 58614 | 58632 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
User_Id | |||||||||||||||||||||
0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
1 | 3.0 | 0.0 | 3.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 | ... | 0.0 | 0.0 | 0.0 | 3.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
2 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
4 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 4.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 rows × 17178 columns
df_cf.shape
(15615, 17178)
test_input_df_original, test_val_df = mask_user_ratings(df_ratings[df_ratings['User_Id']==47], random_state=1)
test_input_df = pivot_ratings(test_input_df_original)
test_input_df
Anime_Id | 1 | 5 | 32 | 4037 | 11757 | 14719 | 30831 | 31933 | 34798 | 38040 | ... | 49026 | 50265 | 50602 | 50710 | 51179 | 52701 | 52741 | 53887 | 54829 | 55818 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
User_Id | |||||||||||||||||||||
47 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | ... | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 |
1 rows × 21 columns
# calculating new index labels for the test input
new_index = pd.Series(list(range(df_cf.index[-1] + 1, df_cf.index[-1] + 1 + len(test_input_df))))
new_index
0 20011
dtype: int64
# Ratings dataset + user input data
df_cf = pd.concat([df_cf, test_input_df.set_index(new_index)]).fillna(0)
df_cf
Anime_Id | 1 | 5 | 6 | 7 | 8 | 15 | 16 | 17 | 18 | 19 | ... | 58564 | 58567 | 58569 | 58572 | 58573 | 58592 | 58600 | 58603 | 58614 | 58632 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
1 | 3.0 | 0.0 | 3.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 | ... | 0.0 | 0.0 | 0.0 | 3.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
2 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
4 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 4.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
20007 | 4.0 | 0.0 | 3.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 4.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
20008 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 | 0.0 | 0.0 | 0.0 | 4.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
20009 | 4.0 | 0.0 | 3.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
20010 | 4.0 | 4.0 | 4.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 4.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
20011 | 4.0 | 4.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
15616 rows × 17178 columns
# Applying SVD on sparse matrix
s = datetime.now()
sparse_cf = csr_matrix(df_cf)
U, sigma, Vt = svds(sparse_cf)
print(datetime.now() - s)
0:00:04.596998
U.shape
(15616, 6)
Vt.shape
(6, 17178)
sigma = np.diag(sigma)
# Reconstruct matrix and normalizing the measures
all_ratings = np.dot(np.dot(U, sigma), Vt)
all_ratings = (all_ratings - all_ratings.min()) / (all_ratings.max() - all_ratings.min())
all_ratings.shape
(15616, 17178)
# Show our reconstructed matrix, for each row of user_id the columns show how closely aligned they are with each anime title
df_cf_pred = pd.DataFrame(all_ratings, columns = df_cf.columns, index = df_cf.index)
df_cf_pred
Anime_Id | 1 | 5 | 6 | 7 | 8 | 15 | 16 | 17 | 18 | 19 | ... | 58564 | 58567 | 58569 | 58572 | 58573 | 58592 | 58600 | 58603 | 58614 | 58632 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.322143 | 0.281029 | 0.284811 | 0.265056 | 0.264715 | 0.270996 | 0.261508 | 0.264742 | 0.271192 | 0.309478 | ... | 0.264124 | 0.276174 | 0.264103 | 0.270113 | 0.268170 | 0.265395 | 0.264182 | 0.264130 | 0.264266 | 0.264123 |
1 | 0.426405 | 0.302308 | 0.289836 | 0.264833 | 0.262356 | 0.255552 | 0.300901 | 0.262838 | 0.282760 | 0.416440 | ... | 0.264126 | 0.313230 | 0.264115 | 0.285468 | 0.267381 | 0.268286 | 0.267278 | 0.264312 | 0.264614 | 0.264151 |
2 | 0.355228 | 0.353592 | 0.215732 | 0.261275 | 0.276188 | 0.306569 | 0.335670 | 0.266856 | 0.261291 | 0.128861 | ... | 0.264151 | 0.243649 | 0.264216 | 0.253360 | 0.270926 | 0.263537 | 0.269985 | 0.265600 | 0.264068 | 0.264374 |
4 | 0.335182 | 0.252427 | 0.282612 | 0.265102 | 0.265241 | 0.278636 | 0.267611 | 0.264956 | 0.287778 | 0.306845 | ... | 0.264129 | 0.291591 | 0.264180 | 0.277689 | 0.264350 | 0.267018 | 0.264216 | 0.264182 | 0.264060 | 0.264139 |
5 | 0.196523 | 0.209384 | 0.235906 | 0.261257 | 0.262673 | 0.249039 | 0.244840 | 0.261336 | 0.264960 | 0.233427 | ... | 0.264131 | 0.296764 | 0.264170 | 0.282507 | 0.261342 | 0.268099 | 0.265622 | 0.264338 | 0.264221 | 0.264163 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
20007 | 0.489428 | 0.306336 | 0.338105 | 0.269412 | 0.263843 | 0.272239 | 0.307359 | 0.264999 | 0.296388 | 0.462530 | ... | 0.264125 | 0.293219 | 0.264164 | 0.272673 | 0.261987 | 0.265958 | 0.264268 | 0.264092 | 0.264209 | 0.264122 |
20008 | 0.543922 | 0.333727 | 0.363319 | 0.268517 | 0.266733 | 0.287688 | 0.256264 | 0.265529 | 0.297220 | 0.494144 | ... | 0.264127 | 0.331264 | 0.264056 | 0.296463 | 0.279677 | 0.271361 | 0.265253 | 0.264365 | 0.264831 | 0.264159 |
20009 | 0.485117 | 0.329416 | 0.334906 | 0.268958 | 0.264527 | 0.280297 | 0.299506 | 0.266346 | 0.291095 | 0.443311 | ... | 0.264120 | 0.283117 | 0.264103 | 0.269180 | 0.268408 | 0.264956 | 0.263968 | 0.263980 | 0.264310 | 0.264098 |
20010 | 0.602587 | 0.379842 | 0.355045 | 0.271216 | 0.264315 | 0.282937 | 0.351978 | 0.267386 | 0.300561 | 0.525480 | ... | 0.264118 | 0.276331 | 0.264112 | 0.261242 | 0.266583 | 0.263021 | 0.264786 | 0.263954 | 0.264307 | 0.264092 |
20011 | 0.281417 | 0.267700 | 0.271291 | 0.264485 | 0.263932 | 0.262970 | 0.264506 | 0.263964 | 0.265495 | 0.284469 | ... | 0.264124 | 0.268596 | 0.264120 | 0.266007 | 0.264448 | 0.264542 | 0.264210 | 0.264130 | 0.264178 | 0.264125 |
15616 rows × 17178 columns
test_pred = df_cf_pred.loc[20011].sort_values(ascending=False).reset_index()
test_pred = test_pred.loc[~test_pred['Anime_Id'].isin(test_input_df_original['Anime_Id'])]
test_pred.head()
Anime_Id | 20011 | |
---|---|---|
1 | 16498 | 0.308918 |
2 | 25777 | 0.306242 |
3 | 35760 | 0.306020 |
4 | 38524 | 0.305755 |
5 | 40028 | 0.305438 |
tester_eval = ModelEvaluator()
print('MRR : ', tester_eval.evaluate_mrr(test_pred, None, test_val_df, topn=10, left_on='Anime_Id'))
print('NDCG : ', tester_eval.evaluate_ndcg(test_pred, None, test_val_df, topn=10, left_on='Anime_Id'))
MRR : 0.14285714285714285
NDCG : 0.32166167872792356
# User only interacted with 1 title out of the top 10 predictions
test_pred.iloc[:10].merge(test_val_df, how='left')
Anime_Id | 20011 | Username | User_Id | Anime_Title | Rating_Status | Rating_Score | Num_Epi_Watched | Is_Rewatching | Updated | Start_Date | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 16498 | 0.308918 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | 25777 | 0.306242 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 35760 | 0.306020 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | 38524 | 0.305755 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | 40028 | 0.305438 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 | 40748 | 0.304172 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
6 | 38000 | 0.304036 | Pynkmouth | 47.0 | Kimetsu no Yaiba | plan_to_watch | 0.0 | 0.0 | False | 2021-03-18 22:08:00+00:00 | NaN |
7 | 48583 | 0.302796 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
8 | 47778 | 0.301383 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
9 | 52991 | 0.301243 | Pynkmouth | 47.0 | Sousou no Frieren | completed | 6.0 | 28.0 | False | 2024-04-11 10:28:04+00:00 | NaN |
class CollaborativeRecommender:
def __init__(self, df_cf):
self.df_original = df_cf
self.df_anime_id = df_cf.groupby(['Anime_Id','Anime_Title']).count().reset_index()[['Anime_Id','Anime_Title']]
def process_df(self, df):
df['Mean_Score'] = 0
mean_df = df[df['Rating_Score']>0].groupby("User_Id")['Rating_Score'].mean().reset_index().rename(columns={'Rating_Score':'mean_score'})
df = df.merge(mean_df)
df['Interactions'] = 0.0
df.loc[df.Rating_Score == 0, 'Interactions'] = 2
df.loc[df.Rating_Score-df.Mean_Score < 0, 'Interactions'] = 1
df.loc[df.Rating_Score-df.Mean_Score == 0, 'Interactions'] = 3
df.loc[df.Rating_Score-df.Mean_Score > 0, 'Interactions'] = 4
df = df.pivot(index='User_Id', columns='Anime_Id', values='Interactions').fillna(0)
return df
def predict_dec(self, user_df, k=15):
max_uid = self.df_original.User_Id.max()
for i, uid in enumerate(user_df.User_Id.unique()):
user_df.loc[user_df.User_Id==uid, 'User_Id'] = max_uid + 1 + i
user_df = pd.concat([self.df_original, user_df])
user_cf = self.process_df(user_df)
sparse_cf = csr_matrix(user_cf)
U, sigma, Vt = svds(sparse_cf)
return U, sigma, Vt, user_cf.columns, user_cf.index
def predict(self, user_df, topn=10, k=15):
# Reconstruct matrix to find similarities
U, sigma, Vt, new_col, new_index = self.predict_dec(user_df, k)
sigma = np.diag(sigma)
all_ratings = np.dot(np.dot(U,sigma), Vt)
all_ratings = (all_ratings - all_ratings.min()) / (all_ratings.max() - all_ratings.min())
# Construct output dataframe, collecting weights from the number of user we have predicted on
df_cf_pred = pd.DataFrame(all_ratings, columns=new_col, index=new_index)
num_users = user_df.User_Id.nunique()
res = df_cf_pred.iloc[-num_users:].T
if num_users == 1:
res = res.sort_values(res.columns[0],ascending=False).reset_index()
res = res.loc[~res['Anime_Id'].isin(user_df['Anime_Id'])][:topn]
else:
res = res.reset_index()
res = res.loc[~res['Anime_Id'].isin(user_df['Anime_Id'])]
return res
As before, we have code to create an object for our Collaborative Recommender System, taking in some inputs and producing recommendations.
# 3 Samples, all anime titles from dataset
df_cf = df_ratings.copy()
cf_rec = CollaborativeRecommender(df_cf)
tester_eval = ModelEvaluator()
count, total_mrr, total_ndcg = 0, 0, 0
s = datetime.now()
number_of_samples = 3
print(f"Number of Anime Titles within our dataset: {df_cf.Anime_Id.nunique()}")
for i in np.random.choice(df_ratings.User_Id.unique(), number_of_samples, replace=False):
s_inner = datetime.now()
count += 1
test_input_df, test_val_df = mask_user_ratings(df_ratings[df_ratings['User_Id']==i], random_state=1)
pred_df = cf_rec.predict(test_input_df, 10)
total_mrr += tester_eval.evaluate_mrr(pred_df, test_input_df, test_val_df, topn=10, left_on='Anime_Id')
total_ndcg += tester_eval.evaluate_ndcg(pred_df, test_input_df, test_val_df, topn=10, left_on='Anime_Id')
if not count % 10:
print(f'Time Elapsed : {datetime.now()-s}')
print(f"Loop Number {count}, User Id {i}, Time Taken {datetime.now()-s_inner}")
print("Final MRR: " , total_mrr/count)
print("Final NDCG: " , total_ndcg/count)
Number of Anime Titles within our dataset: 17365
Loop Number 1, User Id 13830, Time Taken 0:00:08.383678
Loop Number 2, User Id 4062, Time Taken 0:00:08.232498
Loop Number 3, User Id 17416, Time Taken 0:00:08.202000
Final MRR: 0.7777777777777778
Final NDCG: 0.8000865280044508
Sanity check with 3 samples looks fine, compared to previous approaches this does take a significantly longer time to compute.
df_cf = df_ratings.copy()
cf_rec = CollaborativeRecommender(df_cf)
tester_eval = ModelEvaluator()
count, total_mrr, total_ndcg = 0, 0, 0
s = datetime.now()
number_of_samples = 1000
mrr_collab, ndcg_collab = [], []
for i in np.random.choice(df_ratings.User_Id.unique(), number_of_samples, replace=False):
s_inner = datetime.now()
count += 1
test_input_df, test_val_df = mask_user_ratings(df_ratings[df_ratings['User_Id']==i], random_state=1)
pred_df = cf_rec.predict(test_input_df, 10)
mrr = tester_eval.evaluate_mrr(pred_df, test_input_df, test_val_df, topn=10, left_on='Anime_Id')
ndcg = tester_eval.evaluate_ndcg(pred_df, test_input_df, test_val_df, topn=10, left_on='Anime_Id')
total_mrr += mrr
total_ndcg += ndcg
mrr_collab.append(mrr)
ndcg_collab.append(ndcg)
print(f'Collab MRR: {running_avg(mrr_collab)[-1]}')
print(f'Collab NDCG: {running_avg(ndcg_collab)[-1]}')
Collab MRR: 0.8626269841269844
Collab NDCG: 0.801510934536861
The performance of the collaborative approach is significantly better than the baseline and content base recommndations, suggesting that the assumption that similar users will like similar titles may have some truth in it. A possible explanation for why the performance is not better is that our SVD is not able to accurately recreate the original matrix due to the chosen low number of singular values computed.
Below is a sanity check with 3 samples again, but with obscure titles that have less than 3 ratings removed from the computation
Surprisingly, the performance of this approach is worse than our baseline and content base recommendations. A possible explanation is that our SVD is not able to accurately recreate the original matrix due to the chosen low number of singular values computed. If this approach is re-evaluated with a higher number of singlular values with similarly poor results, it will suggest that the original assumption of similar users liking similar titles is not entire accurate.
Below is a sanity check with 3 samples again, but with obscure titles that have less than 3 ratings removed from the computation
# Remove titles with less than 10 user ratings
df_ratings_subset = df_ratings[['Anime_Id','Anime_Title']].value_counts().reset_index()
df_ratings_subset = df_ratings_subset[df_ratings_subset['count'] >= 10]
df_cf_subset = df_ratings[df_ratings.Anime_Id.isin(df_ratings_subset.Anime_Id)]
print(f"Original number of titles: {df_ratings.Anime_Title.nunique()}")
print(f"Trimmed number of titles: {df_ratings_subset.Anime_Title.nunique()}")
Original number of titles: 17364
Trimmed number of titles: 10559
# 3 Samples, trimmed anime titles from dataset
cf_rec = CollaborativeRecommender(df_cf_subset)
tester_eval = ModelEvaluator()
count, total_mrr, total_ndcg = 0, 0, 0
s = datetime.now()
number_of_samples = 3
print(f"Number of Anime Titles with >= 10 user ratings within our dataset: {df_cf_subset.Anime_Id.nunique()}")
for i in np.random.choice(df_ratings.User_Id.unique(), number_of_samples, replace=False):
s_inner = datetime.now()
count += 1
test_input_df, test_val_df = mask_user_ratings(df_ratings[df_ratings['User_Id']==i], random_state=1)
pred_df = cf_rec.predict(test_input_df, 10)
total_mrr += tester_eval.evaluate_mrr(pred_df, test_input_df, test_val_df, topn=10, left_on='Anime_Id')
total_ndcg += tester_eval.evaluate_ndcg(pred_df, test_input_df, test_val_df, topn=10, left_on='Anime_Id')
if not count % 10:
print(f'Time Elapsed : {datetime.now()-s}')
print(f"Loop Number {count}, User Id {i}, Time Taken {datetime.now()-s_inner}")
print("Final MRR: " , total_mrr/count)
print("Final NDCG: " , total_ndcg/count)
Number of Anime Titles with >= 10 user ratings within our dataset: 10559
Loop Number 1, User Id 19564, Time Taken 0:00:06.346500
Loop Number 2, User Id 8772, Time Taken 0:00:06.117500
Loop Number 3, User Id 16816, Time Taken 0:00:06.155999
Final MRR: 0.75
Final NDCG: 0.7081818849660128
By trimming the really obscure titles from the dataset we are able to reduce computational times significantly.
6. Hybrid Recommendation
For our hybrid approach we will combine both the Content Based and Collaborative approaches we have explored.
For simplicity’s sake we will place a 0.5 weightage for each of the approaches.
Similarity measures from each of these two approaches will be computed separately and standardised so that they are comparable. The final measure will be a combination of the standardised scores based on their weightages.
class HybridRecommender:
def __init__(self, cb_model, cf_model, df_content, df_ratings, cb_weight=0.5):
self.cb_model = cb_model(df_content)
self.cf_model = cf_model(df_ratings)
self.cb_weight = cb_weight
self.cf_weight = 1 - cb_weight
self.n = df_ratings.Anime_Id.nunique()
def predict(self, user_df, topn=10):
num_users = user_df.User_Id.nunique()
cb_pred = self.cb_model.predict(user_df, self.n)
cf_pred = self.cf_model.predict(user_df, self.n)
# Normalize scores from both predictions
ss = StandardScaler()
cb_pred['ss'] = ss.fit_transform(cb_pred['Weights'].values.reshape(-1,1))
cf_cols = ['ss_' + str(col) for col in cf_pred.columns[-1:]]
if num_users == 1:
cf_pred[cf_cols] = ss.fit_transform(cf_pred[cf_pred.columns[-1]].values.reshape(-1,1))
else:
cf_pred[cf_cols] = ss.fit_transform(cf_pred[cf_pred.columns[1:]])
combined_pred = cf_pred.merge(cb_pred[['ss','MAL_Id','Name','Score','Popularity']], how='left', left_on='Anime_Id', right_on='MAL_Id')
combined_pred['Final_Score'] = self.cf_weight*combined_pred[cf_cols].sum(axis=1) + self.cb_weight*combined_pred['ss']
combined_pred = combined_pred.sort_values('Final_Score', ascending=False)
return combined_pred[:topn]
# Remove titles with less than 10 user ratings
df_ratings_subset = df_ratings[['Anime_Id','Anime_Title']].value_counts().reset_index()
df_ratings_subset = df_ratings_subset[df_ratings_subset['count'] >= 10]
df_cf_subset = df_ratings[df_ratings.Anime_Id.isin(df_ratings_subset.Anime_Id)]
tester_eval = ModelEvaluator()
hyb_rec = HybridRecommender(ContentBasedRecommender, CollaborativeRecommender, df_content, df_cf_subset, cb_weight = 0.5)
count, total_mrr, total_ndcg = 0, 0, 0
s = datetime.now()
number_of_samples = 1000
print(f"Number of Anime Titles with >= 10 user ratings within our dataset: {df_cf_subset.Anime_Id.nunique()}")
mrr_hybrid, ndcg_hybrid = [], []
for i in np.random.choice(df_ratings.User_Id.unique(), number_of_samples, replace=False):
s_inner = datetime.now()
count += 1
test_input_df, test_val_df = mask_user_ratings(df_ratings[df_ratings['User_Id']==i], random_state=1)
if len(test_input_df) == 0:
continue
pred_df = hyb_rec.predict(test_input_df, 10)
mrr = tester_eval.evaluate_mrr(pred_df, test_input_df, test_val_df, topn=10, left_on='Anime_Id')
ndcg = tester_eval.evaluate_ndcg(pred_df, test_input_df, test_val_df, topn=10, left_on='Anime_Id')
total_mrr += mrr
total_ndcg += ndcg
mrr_hybrid.append(mrr)
ndcg_hybrid.append(ndcg)
print(f'Hybrid MRR: {running_avg(mrr_hybrid)[-1]}')
print(f'Hybrid NDCG: {running_avg(ndcg_hybrid)[-1]}')
Hybrid MRR: 0.9044096472282224
Hybrid NDCG: 0.8450766763118507
mrr_base_trunc = mrr_base[-1000:]
ndcg_base_trunc = ndcg_base[-1000:]
res = [mrr_base_trunc, ndcg_base_trunc, mrr_content,ndcg_content,mrr_collab,ndcg_collab,mrr_hybrid,ndcg_hybrid]
res = [running_avg(scores) for scores in res]
# Extract mrr and ndcg from list of all scores
mrr_res = res[::2]
ndcg_res = res[1::2]
# Plot averaged cumulative performance over time
fig, ax = plt.subplots(2,1, figsize=(8,8))
mrr_label = ['MRR_Base','MRR_Content','MRR_Collab','MRR_Hybrid']
ndcg_label = ['NDCG_Base','NDCG_Content','NDCG_Collab','NDCG_Hybrid']
for i, r in enumerate(mrr_res):
sns.lineplot(r, ax = ax[0], label = mrr_label[i])
for i, r in enumerate(ndcg_res):
sns.lineplot(r, ax = ax[1], label = ndcg_label[i])
plt.xlabel("Samples")
plt.ylabel("Cumulative Average")
plt.suptitle("Cumulative Score Plot")
ax[0].title.set_text("MRR")
ax[1].title.set_text("NDCG")
plt.show()
The performance of the hybrid approach appears to be significantly better than our previous two approaches and the baseline. This may be due to the limitations mentioned in our previous sections where the content based approach tend to stick to titles of the same kind, and for collaborative approach similar users tend to like similar titles, may not translate well to actual user behaviour. The higher performance of the hybrid approach suggests that in reality users may mainly enjoy titles of the same type while also seeking out some diversity in their interactions, where this diversity coincides with other similar users have interacted with.
final_res = [i[-1] for i in res]
final_mrr = final_res[::2]
final_ndcg = final_res[1::2]
final_results = pd.DataFrame({'MRR':final_mrr, "NDCG":final_ndcg}, index=['Baseline','Content','Collab','Hybrid'])
fig,ax = plt.subplots()
c_map = plt.cm.get_cmap('coolwarm').reversed()
im = ax.imshow(final_results.values, cmap=c_map)
cbar = ax.figure.colorbar(im, ax=ax)
cbar.ax.set_ylabel("Performance (Higher is better)", rotation=-90, va="bottom")
ax.set_xticks(np.arange(final_results.shape[1]), labels=final_results.columns)
ax.set_yticks(np.arange(final_results.shape[0]), labels=final_results.index)
for i in range(final_results.shape[0]):
for j in range(final_results.shape[1]):
text = ax.text(j, i, round(final_results.iloc[i, j], 3),
ha='center', va='center', color='w')
plt.tight_layout()
plt.title('Comparison of Final Performances')
Text(0.5, 1.0, 'Comparison of Final Performances')
Above we see a visual representation of the scores that were seen in this notebook, with the hybrid approach coming out on top compared to our baseline and the other two approaches.
7. Conclusion
In this notebook we have explored different approaches to a recommendation system and discussed some possible limitations and their implications or solutions. For the datasets used, we have shown that the hybrid approach performed best out of the ones tested.
Further improvements can be made to our approaches, some key ones are:
- - Utilizing the review dataset to provide additional information for our titles
- - Improving computational performances of the approaches
- - Incorporating additional contextual information such as time to further improve the recommendations
- - Exploring more advanced techniques to calculate recommendations