
  1. Introduction
  2. Anime Info Dataset EDA
  3. Anime Reviews Dataset EDA
  4. User Ratings Dataset EDA
  5. Conclusion

1. Introduction

In this notebook we will explore the datasets scraped by our webscraping scripts. A previous notebook going through these scripts can be found here. Exploration in this notebook will be guided by some key questions within each section.

import pandas as pd
import numpy as np
from ast import literal_eval
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
import statsmodels as sm

2. Anime Info Dataset EDA

In this section we will look at the dataset containing information for the 13,300 titles scraped, guided by the following questions:

  1. How many episodes does a title usually run for?
  2. Longest running titles?
  3. Best performing studios?
  4. Number of titles released over the years
  5. How are titles scored?

First we will look at some general information from this dataset.

cleaned_df = pd.read_csv('cleaned_anime_info.csv')
cleaned_df['Aired_Start'] = pd.to_datetime(cleaned_df['Aired_Start'], errors='coerce')
cleaned_df['Aired_End'] = pd.to_datetime(cleaned_df['Aired_End'], errors='coerce')
Here we notice some missing values from 'Episodes','Rating_Start','Aired_End','Premiered_Season'.

2.1 How many episodes does a title usually run for?

sns.countplot(data=cleaned_df, x='Type')
Within "Type" TV series is the largest category, for the other categories we see that a majority of them are a one time thing with a single episode so we will be focusing on the TV category for this section

sns.histplot(data=cleaned_df[(cleaned_df.Type=='TV')&(cleaned_df.Episodes<=30)], x='Episodes', bins=30, binrange=(0,30))
plt.title('Distribution of No. of Episodes (TV Titles <= 30 Episodes)')
Text(0.5, 1.0, 'Distribution of No. of Episodes (TV Titles <= 30 Episodes)')


For TV series with <= 30 Episodes, the most popular run length is 12 episodes long, followed by 13 and 26 episode lengths.

sns.histplot(data=cleaned_df[(cleaned_df.Type=='TV')&(cleaned_df.Episodes>30)&(cleaned_df.Episodes<=250)], x='Episodes', bins=22, binrange=(30,250))
plt.title('Distribution of No. of Episodes (TV Titles 30 < Episodes <= 250)')
Text(0.5, 1.0, 'Distribution of No. of Episodes (TV Titles 30 < Episodes <= 250)')


Looking at titles with 30 < Eipsodes <= 250, most fall within the 50 to 60 episodes range, followed by 40 to 50 and 30 to 40 episodes.

sns.histplot(data=cleaned_df[(cleaned_df.Type=='TV')&(cleaned_df.Episodes>250)], x='Episodes', bins=26, binrange=(250,3500))
plt.title('Distribution of No. of Episodes (TV Titles 250 > Episodes)')
Text(0.5, 1.0, 'Distribution of No. of Episodes (TV Titles 250 > Episodes)')


Looking at the outlier long running titles we see that most fall between 250 to 500 episodes, with an outlier running for more than 3000 episodes!

cleaned_df[cleaned_df.Episodes > 1000][['MAL_Id','Name','Episodes','Status','Popularity','Score','Rank','Aired_Start','Aired_End']]
MAL_Id Name Episodes Status Popularity Score Rank Aired_Start Aired_End
904 2471 Doraemon (1979) 1787.0 Finished Airing 2737 7.765135 905 1979-04-02 2005-03-18
6897 6277 Manga Nippon Mukashibanashi (1976) 1471.0 Finished Airing 11722 6.415709 6898 1976-01-07 1994-09-03
9028 9947 Lan Mao 3057.0 Finished Airing 13932 5.680556 9029 1999-10-08 2001-08-01
10202 8213 Hokahoka Kazoku 1428.0 Finished Airing 13747 5.657343 10203 1976-10-01 1982-03-31
10271 32448 Kirin Ashita no Calendar 1306.0 Finished Airing 15099 5.510000 10272 1980-01-01 1984-10-06
10709 22221 Monoshiri Daigaku: Ashita no Calendar 1274.0 Finished Airing 14150 5.542986 10710 1966-07-01 1970-08-02
11038 10241 Sekai Monoshiri Ryoko 1006.0 Finished Airing 14208 5.642553 11039 1971-10-01 1974-12-31
11256 23349 Kirin Monoshiri Yakata 1565.0 Finished Airing 14288 5.335740 11257 1975-01-01 1979-12-31
11641 12393 Oyako Club 1818.0 Finished Airing 12922 5.518605 11642 1994-10-03 2013-03-30

Above are the outlier titles with the highest number of episodes! Interestingly, except for Doraemon it seems like none of the titles are particularly highly rated or popular.

2.2 Longest running titles?

Finished Airing     13197
Currently Airing      103
Name: count, dtype: int64
currently_airing = cleaned_df[cleaned_df.Status=='Currently Airing']
sns.histplot(currently_airing, x='Aired_Start')
plt.title("Currently Airing Titles' Premier Years")
Text(0.5, 1.0, "Currently Airing Titles' Premier Years")


Looking at titles that are still running, we see that the majority of them premiered in the 2020s as expected. The oldest running title appears to be from around 1970, more than 50 years ago!

currently_airing[currently_airing['Aired_Start'].dt.year < 2000][['MAL_Id','Name','Episodes','Status','Score','Popularity','Rank','Aired_Start']]
MAL_Id Name Episodes Status Score Popularity Rank Aired_Start
51 21 One Piece NaN Currently Airing 8.741164 19 52 1999-10-20
405 235 Meitantei Conan NaN Currently Airing 8.196769 675 406 1996-01-08
997 966 Crayon Shin-chan NaN Currently Airing 7.840472 2313 998 1992-04-13
2602 6149 Chibi Maruko-chan (1995) NaN Currently Airing 7.409741 7834 2603 1995-01-08
3751 1199 Nintama Rantarou NaN Currently Airing 7.207432 7119 3752 1993-04-10
5304 1960 Sore Ike! Anpanman NaN Currently Airing 6.892672 9096 5305 1988-10-03
7489 4459 Ojarumaru NaN Currently Airing 6.338290 11094 7490 1998-10-05
8878 2406 Sazae-san NaN Currently Airing 6.185860 6857 8879 1969-10-05

Above are titles that starting airing before 2000 and are still airing today, It appears that for these titles the Episodes field is empty as there is no known end date/episode. Compared to the previous section where we looked at aired titles with the highest number of episodes, these titles are more popular and higher rated in general, with very recognizable names like One Piece, Detective Conan, and Crayon Shin-chan within the list.

MAL_Id                                                             2406
Name                                                          Sazae-san
Type                                                                 TV
Episodes                                                            NaN
Status                                                 Currently Airing
Producers                                                   ['Fuji TV']
Licensors                                    ['None found', 'add some']
Studios                                                       ['Eiken']
Source                                                     4-koma manga
Genres                                      ['Comedy', 'Slice of Life']
Duration                                                        24 min.
Rating                                                     G - All Ages
Score                                                           6.18586
Popularity                                                         6857
Members                                                            8369
Favorites                                                            37
Watching                                                           2171
Completed                                                             1
On-Hold                                                             741
Dropped                                                            1826
Plan to Watch                                                      3630
Total                                                              8369
Score-10                                                            248
Score-9                                                              97
Score-8                                                             180
Score-7                                                             354
Score-6                                                             353
Score-5                                                             266
Score-4                                                             101
Score-3                                                              55
Score-2                                                              48
Score-1                                                             165
Synopsis              The main character is a mother named Sazae-san...
Voice_Actors          ['Katou, Midori', 'Nagai, Ichiro', 'Sasuga, Ta...
Recommended_Ids                                 ['951', '6149', '6625']
Recommended_Counts                                      ['1', '1', '1']
Aired_Start                                         1969-10-05 00:00:00
Aired_End                                                           NaT
Premiered_Season                                                    4.0
Rank                                                               8879
Name: 8878, dtype: object

Looking at the oldest currently airing title named Sazae-san, apparently it has aired more than 8000 episodes thus far! This is more episodes than the highest number we have seen in this dataset (3057 episodes), however as the title is still airing with no planned ending, the final episode count is listed as Unknown on the website.

2.3 Best performing studios?

studios = cleaned_df.copy()
0                           ['Madhouse']
1                              ['Bones']
2                          ['White Fox']
3              ['Bandai Namco Pictures']
4                         ['Wit Studio']
13295       ['Toei Animation', 'Gallop']
13296    ['Production I.G', 'Signal.MD']
13297                            ['OLM']
13298                 ['Toei Animation']
13299         ['None found', 'add some']
Name: Studios, Length: 13300, dtype: object
# evaluate the Studios column as a list of strings
studios['Studios'] = studios['Studios'].apply(literal_eval)
# explode the studios column to make multiple entries for titles with multiple studios, one try for each studio
studios = studios.explode('Studios')
We see that Toei Animation has worked on the highest number of titles, 2219 titles have missing studios information

# Remove entries with unknown studios
studios = studios[(~studios.Studios.str.contains('add some|None found'))]

Distribution of Number of Titles per Studio

num_titles = studios.Studios.value_counts().reset_index()
print("Number of Studios with only 1 title : ", len(num_titles[num_titles['count'] == 1]))
print("Number of Studios with 2~5 titles : ", len(num_titles[(num_titles['count']<=5)&(num_titles['count']>=2)]))
print("Number of Studios with >5 titles : ", len(num_titles[num_titles['count'] >5] ))
Number of Studios with only 1 title :  253
Number of Studios with 2~5 titles :  285
Number of Studios with >5 titles :  272

More than 2/3 of Studios are attributed to <= 5 titles

ax = sns.histplot(num_titles[num_titles['count'] >5], x='count', bins=50)
ax.set(xlabel='Number of Titles', ylabel='Number of Studios')
plt.title('Distribution of Number of Titles per Studio (with >5 Titles)')
Text(0.5, 1.0, 'Distribution of Number of Titles per Studio (with >5 Titles)')


Looking at studios with >5 titles, most fall under 100 titles. However some larger/longer standing studios have more than 300 titles attributed to them.

Distribution of Average Score per Studio

avg_score = studios.groupby('Studios')['Score'].mean().reset_index()
ax = sns.histplot(avg_score, x='Score')
ax.set(xlabel='Average Score')

quantiles = ['25%','50%','75%']
colors = ['red','orange','green']
for i in range(len(quantiles)):
    a = avg_score.describe().transpose()[quantiles[i]]['Score']
    ax.axvline(a, color=colors[i], label=f'Q{i+1} : {a:.2f}')
plt.title('Distribution of Average Scores of the Studios')
Text(0.5, 1.0, 'Distribution of Average Scores of the Studios')


Most studios have an average score between 5.80 to 6.82

studios_agg = studios.groupby('Studios')['Score'].agg(['count','mean']).reset_index().sort_values('count', ascending=False)
print(f"The top 10 studios has worked on {studios_agg[:10]['count'].sum() * 100 / studios_agg['count'].sum():.2f}% of all titles in the dataset")
The top 10 studios has worked on 29.96% of all titles in the dataset
fig,ax = plt.subplots(2,1)
sns.barplot(studios_agg[:10], x='Studios', y='count', ax=ax[0])
sns.barplot(studios_agg[:10], x='Studios', y='mean', ax=ax[1])
ax[0].set(ylabel='Number of Titles')
ax[0].set_ylim(bottom = studios_agg[:10]['count'].min()-20, top = studios_agg[:10]['count'].max()+20)
ax[1].set(ylabel='Mean Score')
ax[1].set_ylim(bottom = studios_agg[:10]['mean'].min()-0.5, top = studios_agg[:10]['mean'].max()+0.5)
ax[1].set_xticklabels(ax[1].get_xticklabels(), rotation=45, ha='right')

quantiles = ['50%','75%']
colors = ['orange','green']
for i in range(len(quantiles)):
    a = avg_score.describe().transpose()[quantiles[i]]['Score']
    ax[1].axhline(a, color=colors[i], label=f'Q{i+2} : {a:.2f}')
fig.legend(loc='center right')
fig.suptitle('Number of Titles/Mean Scores of the Top 10 Largest Studios')


All the top studios have average scores better than the median of the industry, a few of them such as Production I.G and A-1 Pictures have average scores better than the 75% percentile of the industry.

studios_agg.sort_values('mean', ascending=False)[:10]
cleaned_df[cleaned_df.Studios.str.contains('Nippon Ramayana Film Co.')]
MAL_Id Name Type Episodes Status Producers Licensors Studios Source Genres ... Score-2 Score-1 Synopsis Voice_Actors Recommended_Ids Recommended_Counts Aired_Start Aired_End Premiered_Season Rank
208 4921 Ramayana: The Legend of Prince Rama Movie 1.0 Finished Airing ['TEM Co.', 'Ltd.'] ['None found', 'add some'] ['Nippon Ramayana Film Co.'] Other ['Adventure'] ... 21 69 Rama, the eldest prince of the Kingdom of Ayod... [] ['40834', '249'] ['1', '1'] 1993-01-15 NaT 1.0 209

1 rows × 40 columns

The studio with the highest mean score has only 1 title as shown above, looks like it was a production studio set up specifically to produce the Ramayana Movie.

studios_agg[studios_agg['count'] >= 10].sort_values('mean', ascending=False)[:10]
When looking at studios that have worked on >=10 titles, Motion Magic appears to have the highest average score. Some notable studios in the above list are Kyoto ANimation, Wit Studio, and Bones with >50 titles and a high average score suggesting that they have a really good track record of producing well rated titles.

2.4 Number of anime titles over the years

sns.histplot(cleaned_df.Aired_Start.dt.year, bins = 50)
<Axes: xlabel='Aired_Start', ylabel='Count'>


From the above plot we see the number of titles being producted plummeted around 1930s~1960s, possibly due to WW2 and its aftermath. Since then it have been increasing, with a boom since the 2000s. In recent years ~500 titles are released every year compared to the ~100 at year 2000.

2016.0    617
2017.0    603
2018.0    592
2014.0    583
2015.0    533
2021.0    533
2019.0    510
2022.0    494
2012.0    494
2013.0    487
Name: count, dtype: int64

The top 10 years with the highest number of titles released are all from between 2010 and now.

2.5 How are titles scored?

scores = cleaned_df[cleaned_df.columns[12:32]].copy()
scores['MAL_Id'] = cleaned_df['MAL_Id'].copy()
scores['Name'] = cleaned_df['Name'].copy()
scores['Rank'] = cleaned_df['Rank'].copy()
scores['Popularity'] = cleaned_df['Popularity'].copy()
scores['Episodes'] = cleaned_df['Episodes'].copy()
5 rows × 24 columns

ax = sns.histplot(scores, x='Score')

quantiles = ['25%','50%','75%']
colors = ['red','orange','green']
for i in range(len(quantiles)):
    a = scores.describe().transpose()[quantiles[i]]['Score']
    ax.axvline(a, color=colors[i], label=f'Q{i+1} : {a:.2f}')
plt.title('Distribution of Scores Across All Titles')
Text(0.5, 1.0, 'Distribution of Scores Across All Titles')


Looking at the distribution of scores for all titles we see that most scores fall between 5.83~7.21.

The distribution suggests that users are not using the entire scale, rather than a 1-10 rating system it looks more like a 4-10 system with 6/7 being an “average” rating.

Next we can try looking at the deviation of scores to see which are the more polarizing titles according to user ratings.

def calc_sd(data):
    #MAD = (summ |xi - xmean|) / n
    n = 0
    total = 0
    xmean = 0
    xdiff = 0
    for i in range(1,11):
        col = 'Score-' + str(i)
        total += data[col] * int(i)
        n += data[col]
    xmean = total/n
    for i in range(1,11):
        col = 'Score-' + str(i)
        xdiff += (abs(int(i)-xmean) ** 2) * data[col]
    return (xdiff/n) ** 0.5
scores['SD'] = scores.apply(calc_sd, axis=1)
0        1.361661
1        1.674361
2        1.430520
3        2.009090
4        1.289965
13295    3.203730
13296    1.950239
13297    2.841099
13298    2.860655
13299    3.108680
Name: SD, Length: 13300, dtype: float64
scores[['Name','Rank','Score','SD','Popularity']].sort_values('SD', ascending = False)[:10]
Name Rank Score SD Popularity
13288 Shin Yaranai ka 13289 6.631579 3.615908 10908
13159 Chicken Papa 13160 4.150509 3.381850 10311
12363 Uobbuchou 12364 4.731707 3.325372 18258
11684 Kenda Master Ken (TV) 11685 5.150235 3.323831 13899
10645 Mahou no LumiTear 10646 7.172691 3.320603 14605
10853 Yousei Dick 10854 5.659574 3.316795 14295
12335 Yodel no Onna 12336 5.741093 3.316546 12113
11676 Burutabu-chan 11677 4.396694 3.296578 18372
13015 Chargeman Ken! 13016 4.353383 3.259139 8821
10239 Xiong Chumo: Xiong Zin Guilai 10240 4.469027 3.256228 18319
Score-1 Score-2 Score-3 Score-4 Score-5 Score-6 Score-7 Score-8 Score-9 Score-10
13015 404 227 206 149 130 72 62 35 30 281

Looks like Titles with the largest score deviation are generally lower scoring titles with lower popularity. Possibly due to the lower number of user votes and obscurity of the title, the title will be dominated by scores at both ends of the 1-10 scale rather than the centre/one end of the scale as one would expect.

scores[['Name','Rank','Score','SD','Popularity']].loc[scores['Rank'] < 300].sort_values('SD', ascending = False)[:10]
Name Rank Score SD Popularity
9 Ginga Eiyuu Densetsu 10 8.647633 2.177381 745
49 Ashita no Joe 2 50 8.380418 2.172501 3045
5 Gintama: The Final 6 8.862701 2.126664 1538
3 Gintama° 4 8.726812 2.009090 341
272 Blue Archive the Animation 273 8.379562 1.946342 4056
143 Mo Dao Zu Shi: Wanjie Pian 144 8.427876 1.929255 2377
160 Gintama: Yorinuki Gintama-san on Theater 2D 161 8.295726 1.914696 3306
139 Aria the Origination 140 8.325882 1.849753 1725
227 Tian Guan Cifu Special 228 8.325569 1.834084 3157
208 Ramayana: The Legend of Prince Rama 209 8.390935 1.804288 6079
Looking at only the Top 300 titles we see the highest standard deviation of scores is significantly lower at ~2.18. Compared to the previous title, we see that this title appears to be universally well acclaimed with a large amount of perfect scores. The variance in scores appear to be driven up by a sizeable subset of users giving it a score of 1.

scores['PauseWatchRatio'] = (scores['Dropped']+scores['On-Hold'])/(scores['Completed']+scores['Watching'])
scores[['Rank','Name','Score','Episodes','PauseWatchRatio','Popularity']].iloc[:100].sort_values('PauseWatchRatio', ascending=False)[:10]
Rank Name Score Episodes PauseWatchRatio Popularity
14 15 Gintama 8.616600 201.0 0.332894 139
51 52 One Piece 8.741164 NaN 0.280334 19
9 10 Ginga Eiyuu Densetsu 8.647633 110.0 0.268197 745
69 70 Mushishi 8.542181 26.0 0.219599 215
24 25 Monster 8.753179 74.0 0.209530 133
99 100 Shouwa Genroku Rakugo Shinjuu 8.477288 13.0 0.170704 833
57 58 Great Teacher Onizuka 8.611780 43.0 0.137601 218
3 4 Gintama° 8.726812 51.0 0.130167 341
45 46 Cowboy Bebop 8.710319 26.0 0.119984 43
50 51 Shouwa Genroku Rakugo Shinjuu: Sukeroku Futata... 8.608779 12.0 0.114479 1272

Within the top 100 titles, when comparing the ratio of users who have dropped/paused the title vs those who have completed/currently-watching, the top three titles with the highest ratio of watchers abandoning the show all have a high number of episodes with approximately 1/3 ~ 1/4 of watchers abandoning it partway. One Piece has more than 1000 episodes as of 2024.

Even though these titles are highly rated and generally high in popularity, it appears that the length of a series is a factor in whether a user watches a titles completely.

sns.heatmap(scores.iloc[:100].corr(numeric_only=True), mask = np.triu(scores.iloc[:100].corr(numeric_only=True)))
<Axes: >


Looking at the heatmap above for the top 100 titles, it looks like number of Episodes is indeed linearly correlated with PauseWatchRatio. Another interesting observation is that a title’s average score is correlated to the number of Score-1 that it has received, suggesting that highly rated titles may be bombarded by Score-1 ratings for whatever reason.

sns.heatmap(cleaned_df.corr(numeric_only=True), mask = np.triu(cleaned_df.corr(numeric_only=True)), cmap='icefire')
<Axes: >


Above we see a correlation heatmap of the variables within our dataset, with many observations that aligns with what one would commonly expect. Titles’ ratings/popularity are correlated the number of times they get added to a user’s to-watch list.

3. Anime Reviews Dataset EDA

df = pd.read_csv('cleaned_anime_reviews.csv')
df2 = df.merge(cleaned_df[['MAL_Id','Name','Type','Episodes','Status','Source','Aired_Start','Aired_End','Rank']])
The max number of reviews for any title within the dataset appears to be 20 as the source scrapes only the first page of reviews for every title.

print(f"Titles with at least a full page of reviews: {(df2.groupby('MAL_Id')['review_id'].nunique() == 20).sum()} / {len(df2.groupby('MAL_Id')['review_id'].nunique())} titles")
Titles with at least a full page of reviews: 2257 / 9215 titles
# Count length of each review 
df2['review_length'] = df2.Review.apply(len)
# collate unique review per MAL id
review_len = df2.groupby('MAL_Id')['review_id'].unique().reset_index()
def sum_reviews(data, ref=df2):
    reviews_len = 0
    for num in data['review_id']:
        reviews_len += ref.loc[ref.review_id == num]['review_length'].values[0]
    return reviews_len
# Total length of all reviews per MAL Id
review_len['total_review_length'] = review_len.apply(sum_reviews, axis=1)
review_len = review_len.merge(cleaned_df[['MAL_Id','Name','Rank']], how = 'left')
review_len = review_len.merge(df2.groupby('MAL_Id')['Tags'].value_counts().unstack().reset_index(), how = 'left')
review_len.sort_values('total_review_length', ascending=False)[:10]
Hunter X Hunter (2011) has the longest reviews in the first page with a total of over 200,000 characters!

tmp = (df2.groupby('MAL_Id')['review_id'].nunique() == 20).reset_index()
tmp = tmp[tmp['review_id'] == True]['MAL_Id'].values
(2257, 13)
ax = sns.histplot(review_len[review_len['MAL_Id'].isin(tmp)], x='total_review_length')
quantiles = ['25%','50%','75%']
colors = ['red','orange','green']
for i in range(len(quantiles)):
    a = review_len[review_len['MAL_Id'].isin(tmp)].describe()['total_review_length'].transpose()[quantiles[i]]
    ax.axvline(a, color=colors[i], label=f'Q{i+1} : {a:.2f}')
plt.title('Distribution of Total Review Length (Titles with Full First Page of Reviews)')
Text(0.5, 1.0, 'Distribution of Total Review Length (Titles with Full First Page of Reviews)')


Most titles with a filled first page have between 47000 to 77000 total review characters on the first page.

4. User Ratings Dataset EDA

In the final section we will be exploring the user ratings dataset guided by the following questions

  1. Average number of titles on a user’s list?
  2. Average number of lists a title is added to?
  3. Is this dataset representative of the population data?
df = pd.read_csv('cleaned_user_ratings.csv')
4.1 Average Number of titles on a user’s list?

print(f'Dataset contains {df.Username.nunique()} unique usernames')
Dataset contains 17513 unique usernames
print(f'{(len(df)-df.Rating_Score.value_counts()[0])*100/len(df):.2f}% of the entries in the dataset have not been rated')
54.45% of the entries in the dataset have not been rated
ax = sns.histplot(df.groupby('Username').count()[['User_Id']], x='User_Id')
quantiles = ['25%','50%','75%']
colors = ['red','orange','green']
for i in range(len(quantiles)):
    a = df.groupby('Username').count()['User_Id'].describe().transpose()[quantiles[i]]
    ax.axvline(a, color=colors[i], label=f'Q{i+1} : {a:.2f}')
plt.xlabel('Number of Entries in List')
plt.title('Distribution of Number of Titles in User List')
Text(0.5, 1.0, 'Distribution of Number of Titles in User List')


We see that max number of titles in a list is 500, this is due to the webscraping script setting a limit of titles per username to 500, hence the 5000+ usernames with 499 titles in their lists

Within the dataset we see that most users have between 153 to 499 titles in their list.

Within the dataset we see that most users have between 153 to 499 titles in their list.

# Calculate percentage of rated entries in a user's list
df_rated = df[df.Rating_Score != 0].groupby('Username').count()[['User_Id']].reset_index()
df_all = df.groupby('Username').count()['User_Id'].reset_index()
tmp = df_rated.merge(df_all, how = 'left', on ='Username')
tmp['Percentage'] = tmp['User_Id_x']/tmp['User_Id_y']
16154 rows × 4 columns

ax = sns.histplot(tmp, x='Percentage')
quantiles = ['25%','50%','75%']
colors = ['red','orange','green']
for i in range(len(quantiles)):
    a = tmp.Percentage.describe().transpose()[quantiles[i]]
    ax.axvline(a, color=colors[i], label=f'Q{i+1} : {a:.2f}')
plt.title('Distribution of Percentage of Rated Entries in List')
Text(0.5, 1.0, 'Distribution of Percentage of Rated Entries in List')


Most users have ~39% to 82% of their entire list rated, more than 700 users have not rated any titles on their list, while more than 900 users have rated every title on their list.

4.2 Average number of lists a title is added to?

title_all = df.groupby('Anime_Title').count()['Anime_Id'].reset_index()
17365 rows × 2 columns

ax = sns.histplot(title_all, x='Anime_Id', log_scale=True)
quantiles = ['25%','50%','75%']
colors = ['red','orange','green']
for i in range(len(quantiles)):
    a = title_all.Anime_Id.describe().transpose()[quantiles[i]]
    ax.axvline(a, color=colors[i], label=f'Q{i+1} : {a:.2f}')
ax.set(xlabel='Number of Lists Containing Each Unique Title')
plt.title('Distribution of Percentage of Titles Rated in List')
Text(0.5, 1.0, 'Distribution of Percentage of Titles Rated in List')


Most titles are added to between 4 and 130 user lists, out of around 17000 unique users we have in the dataset.

title_rated = df[df.Rating_Score != 0].groupby('Anime_Title').count()[['Anime_Id']].reset_index()
title_all = df.groupby('Anime_Title').count()['Anime_Id'].reset_index()
tmp = title_rated.merge(title_all, how = 'left', on ='Anime_Title')
tmp['Percentage'] = tmp['Anime_Id_x']/tmp['Anime_Id_y']
14913 rows × 4 columns

ax = sns.histplot(tmp, x='Percentage')
quantiles = ['25%','50%','75%']
colors = ['red','orange','green']
for i in range(len(quantiles)):
    a = tmp.Percentage.describe().transpose()[quantiles[i]]
    ax.axvline(a, color=colors[i], label=f'Q{i+1} : {a:.2f}')
plt.title('Distribution of Percentage of Titles Rated in List')
Text(0.5, 1.0, 'Distribution of Percentage of Titles Rated in List')


Most titles are rated in 39% to 64% of the lists they are added to. In the plot we see a peak at around 1.0 where 1400 titles have almost 100% rating rate.

tmp[tmp.Percentage > 0.9].sort_values('Anime_Id_y', ascending=False)
1397 rows × 4 columns

tmp[(tmp.Percentage > 0.9)&(tmp.Anime_Id_y == 1)]
This observation is due to almost 900 titles having only been added to 1 user list, and that entry is also rated. For the remaining 500 titles they are similarly obscure titles where very few users have added them to their lists, and when they appear on a user list they are usually rated.

tmp[(tmp.Percentage < 0.1)]
tmp[(tmp.Percentage < 0.1)&(tmp.Anime_Id_y > 1000)].sort_values('Anime_Id_y', ascending=False)
At the other end of the spectrum we see a low number of titles with less <0.1 on the plot. Most of these titles appear to be highly anticipated titles that are currently airing or recently released, where users have not managed to complete and rate them.

4.3 Is this dataset representative of the population data?

As our user data contains ratings from a subset of the total user population on the site, we want to check if our data is representative of the site’s rating data.

ax = sns.histplot(df[df.Rating_Score!=0].groupby('Anime_Title')[['Rating_Score']].mean(), x='Rating_Score')
quantiles = ['25%','50%','75%']
colors = ['red','orange','green']
for i in range(len(quantiles)):
    a = df[df.Rating_Score!=0].groupby('Anime_Title')[['Rating_Score']].mean().describe().transpose()[quantiles[i]]['Rating_Score']
    ax.axvline(a, color=colors[i], label=f'Q{i+1} : {a:.2f}')
plt.title('Distribution of Average Title Scores')
Text(0.5, 1.0, 'Distribution of Average Title Scores')


Looking at the distribution of average title scores from our user ratings data, most average scores fall between 5.27 to 7.29. This is fairly close to the site average of 5.83 to 7.21 that we saw previously.

In this distribution we see peaks at every whole number due to obscure titles being added to very few (or a single) list and ending up with a whole number for its average score.

We can also compare the distributions with a student t test to get a more quantitative result.

scipy.stats.ttest_ind(cleaned_df['Score'].values, df[df.Rating_Score!=0].groupby('Anime_Title')['Rating_Score'].mean().values)
TtestResult(statistic=3.565581711830044, pvalue=0.0003636502206012928, df=28211.0)
sm.stats.weightstats.ztest(cleaned_df['Score'].values, df[df.Rating_Score!=0].groupby('Anime_Title')['Rating_Score'].mean().values)
(3.565581711830044, 0.0003630500223969563)

We see T statistics of 3.56 and p « 0.05, suggesting that the actual population mean is higher than our sample mean. Hence our sample mean is not representative of the population mean.

Some possible reasons as why this might have happened:

  1. User data was scraped using from users who were active at the time of scraping. If user behaviour when rating titles has changed from the past our sample data will not be able to show these changes. However the data will still be considered in the population mean
  2. Insufficient samples were scraped, evident from the peaks at the whole numbers. Additional user ratings may need to be scraped to obtain more samples for more obscure titles.

5. Conclusion

In this notebook we have explored the dataset that were scraped from the community site, drawing insights on content attributes, industry behaviour, and user behaviour such as the average user uses only the 4-10 ratings on the 1-10 scale provided.

Some limitations of the data was also identified, most notably the webscraping max limits placed on reviews per title and titles per userlist resulting in incomplete data. The user rating data was also found to not be representative of the site’s average ratings, possibly due to insufficient data collected.

This exploration has provided some insight and intuition around these datasets, allowing us to continue with creating models from these datasets now that we have a better understanding on the data available.