Contents

  1. Introduction
  2. Seasonal Title Scraping
  3. Active Users Scraping
  4. Conclusion

1. Introduction

In previous notebooks we have explored scraping a popular anime database/community MyAnimeList for content and user rating information. After some EDA and exploration of Recommendation Systems using the scraped data I have thought of ways to improve our approach to how the data is scraped.

Limitations of the previous implementation that we want to address here:

  1. Long run times due to multiple requests required for each Title in serial
  2. Sampling a small subset of random active users do not return enough rating data for less popular Titles.

Proposed solutions to the above:

  1. Make requests in parallel, reduce number of requests required per Title by 3 orders of magnitude. Achieved by scraping areas of the website listing information for entire seasons of Titles within a single webpage, instead of scraping multiple subpages for each Title that was done prior.
  2. Identify Titles that we want more ratings data for, and from the Title’s subpage we directly scrape users that have recent interactions with the Title.

In this notebook I will walk through the implementation of the improved approach, the goal being to improve scraping performance and the quality of information that the scripts are returning. This new implementation will also allow us to easily update our anime content/user ratings datasets that are being used by the recommendation system, quickly keeping it up-to-date whenever a new season of titles is released.

2. Seasonal Title Scraping

import os
from bs4 import BeautifulSoup
import requests
import time
import pandas as pd
import random
import re
import csv
from multiprocessing.pool import ThreadPool as Pool

import logging
import tqdm
# Get request from site
site_url = 'https://myanimelist.net'
top_anime_url = site_url + '/anime/season/'
response = requests.get(top_anime_url + '2023/winter')
response.status_code
200

The above webpage contains summary information of all the anime titles that were released during the season, allowing us to scrape relevant data from a large number of titles with a single request, in constrast sending at least one request for each title.

png

# Extract html information from the webpage
doc = BeautifulSoup(response.text)

# Extract relevant portion of the webpage
#row_contents = doc.find_all('div', {'class':'js-anime-category-producer'})
type_contents = doc.find_all('div', {'class':'seasonal-anime-list'})

len(type_contents)
6
total = 0
for i in range(len(type_contents)):
    print('Media Type :', type_contents[i].find('div', {'class':'anime-header'}).text)
    print('Number of Titles: ', len(type_contents[i].find_all('div', {'class':'js-anime-category-producer'})))
    total += len(type_contents[i].find_all('div',{'class':'js-anime-category-producer'}))
print(f'Total Number of Titles This Season: {total}')
          
Media Type : TV (New)
Number of Titles:  62
Media Type : TV (Continuing)
Number of Titles:  64
Media Type : ONA
Number of Titles:  83
Media Type : OVA
Number of Titles:  11
Media Type : Movie
Number of Titles:  25
Media Type : Special
Number of Titles:  11
Total Number of Titles This Season: 256

In the above webpage for 2023 Winter season there are 6 different media types with a total of 256 titles in the single response we received.

row_contents = type_contents[0].find_all('div', {'class':'js-anime-category-producer'})
# Starting Date, Number of Episodes, Episode Duration
[x.replace(' ','') for x in row_contents[0].find('div', {'class':'prodsrc'}).text.split('\n') if x.replace(' ','') != '']
['Jan10,2023', '24eps,', '25min']
# Genre
[x.text.strip() for x in row_contents[0].findAll('span', {'class': 'genre'})]
['Action', 'Adventure', 'Drama']
# Process Number of Episodes and Episode Duration into their own dictionary
def process_prodsrc(row_content):
    content = [x.replace(' ','') for x in row_content.find('div', {'class':'prodsrc'}).text.split('\n') if x.replace(' ','') != '']
    content_dict = {'Episodes':0,'Duration':0}
    for c in content:
        if ('ep' in c or 'eps' in c) and 'Sep' not in c:
            content_dict['Episodes'] = c.replace('eps', '').replace('ep','').replace(',','')
        elif 'min' in c:
            content_dict['Duration'] = c.replace('min', '')
    return content_dict
# Studio, Source, Themes, Demographic
[x for x in row_contents[0].find('div', {'class':'properties'}).text.split('\n') if x != '']
['StudioMAPPA', 'SourceManga', 'ThemesGoreHistorical', 'DemographicSeinen']
# Process Studio, Source, Themes, Demographic into their own dictionary
def process_properties(row_content):
    content = [x for x in row_content.find('div', {'class':'properties'}).text.split('\n') if x != '']
    content_dict = {'Studio':'', 'Source': '', 'Theme': '', 'Demographic': ''}
    for c in content:
        for k in content_dict.keys():
            if k in c:
                content_dict[k] = c.replace(k, '')
    return content_dict
# Clean Synopsis text
def clean_text(text):
    text = text.replace('\t','').replace('\n',' ').replace('\r',' ')
    text = re.sub(' +', ' ',text).rstrip('\\').strip()
    return text
# Cleaned Synopsis
clean_text(row_contents[0].find('div', {'class':'synopsis'}).text)
"After his father's death and the destruction of his village at the hands of English raiders, Einar wishes for a peaceful life with his family on their newly rebuilt farms. However, fate has other plans: his village is invaded once again. Einar watches helplessly as the marauding Danes burn his lands and slaughter his family. The invaders capture Einar and take him back to Denmark as a slave. Einar clings to his mother's final words to survive. He is purchased by Ketil, a kind slave owner and landlord who promises that Einar can regain his freedom in return for working in the fields. Soon, Einar encounters his new partner in farm cultivation—Thorfinn, a dejected and melancholic slave. As Einar and Thorfinn work together toward their freedom, they are haunted by both sins of the past and the ploys of the present. Yet they carry on, grasping for a glimmer of hope, redemption, and peace in a world that is nothing but unjust and unforgiving. [Written by MAL Rewrite] StudioMAPPA SourceManga ThemesGoreHistorical DemographicSeinen"
# Function to create a dictionary containing all the above information
def extract_info(row_contents, mediatype=''):
    seasonal_contents = []
    for i in range(len(row_contents)):
        prodsrc = process_prodsrc(row_contents[i])
        properties = process_properties(row_contents[i])
        id_ = row_contents[i].find('div', {'class':'genres'})
        title_ = row_contents[i].find('span', {'class':'js-title'})
        score_ = row_contents[i].find('span', {'class':'js-score'})
        members_ = row_contents[i].find('span', {'class':'js-members'})
        start_date_ = row_contents[i].find('span', {'class':'js-start_date'})
        image_ = row_contents[i].find('img')
        
        contents = {
            'MAL_Id': id_.get('id', -1) if id_ else '',
            'Title': title_.text if title_ else '',
            'Image': image_.get('src','') or image_.get('data-src','') if image_ else '',
            'Score': score_.text if score_ else '',
            'Members': members_.text if members_ else '',
            'Start_Date': start_date_.text if start_date_ else '',
            'Episodes': prodsrc['Episodes'],
            'Duration': prodsrc['Duration'],
            'Genres': [x.text.strip() for x in row_contents[i].findAll('span', {'class': 'genre'})],
            'Studio': properties['Studio'],
            'Source': properties['Source'],
            'Themes': re.findall('[A-Z][^A-Z]*', properties['Theme']),
            'Demographic': re.findall('[A-Z][^A-Z]*', properties['Demographic']),
            'Synopsis': clean_text(row_contents[i].find('div', {'class':'synopsis'}).text),
            'Type': mediatype
        }
        seasonal_contents.append(contents)
    return seasonal_contents
seasonal_anime = extract_info(row_contents, 'TV')
a = pd.DataFrame(seasonal_anime)
a.head()
MAL_Id Title Image Score Members Start_Date Episodes Duration Genres Studio Source Themes Demographic Synopsis Type
0 49387 Vinland Saga Season 2 https://cdn.myanimelist.net/images/anime/1170/... 8.81 608171 20230110 24 25 [Action, Adventure, Drama] MAPPA Manga [Gore, Historical] [Seinen] After his father's death and the destruction o... TV
1 52305 Tomo-chan wa Onnanoko! https://cdn.myanimelist.net/images/anime/1444/... 7.79 392520 20230105 13 23 [Comedy, Romance] Lay-duce Web manga [School] [] Childhood friends Tomo Aizawa and Junichirou "... TV
2 50608 Tokyo Revengers: Seiya Kessen-hen https://cdn.myanimelist.net/images/anime/1773/... 7.67 352516 20230108 13 23 [Action, Drama, Supernatural] LIDENFILMS Manga [Delinquents, Time , Travel] [Shounen] In spite of his best time-leaping efforts, Tak... TV
3 48417 Maou Gakuin no Futekigousha II: Shijou Saikyou... https://cdn.myanimelist.net/images/anime/1369/... 6.90 336011 20230108 12 23 [Action, Fantasy] SILVER LINK. Light novel [Mythology, Reincarnation, School] [] As peace returns to the demon realm, Anos Vold... TV
4 50739 Otonari no Tenshi-sama ni Itsunomanika Dame Ni... https://cdn.myanimelist.net/images/anime/1240/... 7.82 309405 20230107 12 23 [Romance] Project No.9 Light novel [School] [] Mahiru Shiina is worthy of her nickname "Angel... TV

Our function appears to be working, collecting the relevant information into a dictionary that can be easily converted into a pandas DataFrame.

Next we will include a few useful functions that we will use when scraping all the Titles available on the website, including some logging functionalities to help us track and debug any issues that may occur during the process.

# Helper functions
### Implement randomized sleep time in between requests to reduce chance of being blocked from site
def sleep(t=3):
    rand_t = random.random() * (t) +0.5
    time.sleep(rand_t)
    
### Save our dictionary to a .csv file
def write_seasonal_csv(items, path):
    written_id = set()
    
    # Assign header names with handling of seasons with no new release in certain media types
    for i in range(len(items)):
        if items[i]:
            headers = list(items[i][0].keys())
            break
    
    # In case no new titles released
    if headers:
        # Open the file in write mode
        if not path in os.listdir():
            with open(path, 'w', encoding='utf-8') as f:
                # Return if there's nothing to write
                if len(items) == 0:
                    return
                # Write the headers in the first line
                f.write('|'.join(headers) + '\n')

        with open(path, 'a', encoding='utf-8') as f:
            # Write one item per line
            for i in range(len(items)):
                for item in items[i]:
                    values = []
                    # Check if title has already been added to prevent duplicated entries, some shows span multiple seasons
                    if item.get('MAL_Id') in written_id:
                        continue
                    for header in headers:
                        values.append(str(item.get(header, "")).replace('|',''))
                    f.write('|'.join(values) + "\n")          
                    written_id.add(item.get('Id'))

### Send request to website
def get_response(url):    
    # Try for up to 3 times per URL
    for _ in range(3):
        try:
            sleep(3)
            response = requests.get(url, headers=req_head)
            # If response is good we return the BS object for further processing
            if response.status_code == 200:
                doc = BeautifulSoup(response.text)
                row_contents = doc.find_all('div', {'class':'seasonal-anime-list'})
                if row_contents is None:
                    logging.warning(f'row_contents is None for {url}')
                    print(f'----------- row_contents is None for {url} ------------')
                return row_contents
            
            # If response suggests we are rate limited, make this thread sleep for ~3 minutes before continuing on next loop
            elif response.status_code == 429 or response.status_code == 405:
                logging.warning(f'{response.status_code} for {url}')
                print(f'----------- {response.status_code} occured for {url} ------------')
                buffer_t = random.random() * (40) + 160
                sleep(buffer_t)
                continue
            
            # Any other unexpected response
            else:
                logging.warning(f'{response.status_code} for {url}')
                print(f'----------- {response.status_code} occured for {url} ------------')
                sleep(5)
                continue
        
        # Any unexpected issues with sending request
        except:
            logging.error('Error trying to send request')
            buffer_t = random.random() * (40) + 100
            sleep(buffer_t)
            continue            
    print("-----------------------------Error sending request-----------------------------")
    print(time.asctime())

    
# Instantiate variables
start_year, end_year = 1917, 2024
seasons = ['winter','spring','summer','fall']
req_head = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:124.0) Gecko/20100101 Firefox/124.0',
           'X-MAL-CLIENT-ID':'e09c24c7eb88c3f399d9bd1355b4e015'}
seasonal_anime_filename = 'seasonal_anime.csv'

logging.basicConfig(filename='seasonal.log', filemode='w', format='%(asctime)s - %(levelname)s - %(message)s')
# Scrape all URLs between start and end years for specified seasons, 
# multiprocessing available, option to override with a specified list of url available as well
def scrape(file_name, start_year=1917, end_year=2024, seasons=['winter','spring','summer','fall'], req=req_head, nprocesses=4, url_list=None):
    top_anime_url = 'https://myanimelist.net/anime/season/'
    
    # If specific URLs are not provided, a list of URLs will be generated based on start/end years and seasons provided.
    if not url_list:
        url_list = [top_anime_url + str(year) + '/' + str(season) for year in range(start_year,end_year+1) for season in seasons]
    
    anime_list = []
    # nprocesses number of threads processing URL list in sequence parallelly
    with Pool(processes=nprocesses) as pool:
        for type_contents in tqdm.tqdm(pool.imap(get_response, url_list), total=len(url_list)):
            if type_contents is None:
                continue
            for i in range(len(type_contents)):
                row_contents = type_contents[i].find_all('div', {'class':'js-anime-category-producer'})
                mediatype = type_contents[i].find('div', {'class':'anime-header'}).text
                seasonal_contents = extract_info(row_contents, mediatype)
                anime_list.append(seasonal_contents)
            sleep(5) # a few seconds sleep before next request is sent to avoid rate limit by site
    
    # Write scraped data to disk
    write_seasonal_csv(anime_list, file_name)
    return anime_list
a = scrape(seasonal_anime_filename, start_year = 1917, end_year = 2024)
 21%|████████████████▋                                                                | 89/432 [04:48<17:23,  3.04s/it]

----------- 405 occured for https://myanimelist.net/anime/season/2000/fall ------------
----------- 405 occured for https://myanimelist.net/anime/season/2001/spring ------------
----------- 405 occured for https://myanimelist.net/anime/season/2001/winter ------------


 21%|████████████████▉                                                                | 90/432 [04:52<19:47,  3.47s/it]

----------- 405 occured for https://myanimelist.net/anime/season/2001/summer ------------


 22%|█████████████████▋                                                               | 94/432 [05:04<15:54,  2.83s/it]

----------- 405 occured for https://myanimelist.net/anime/season/2001/spring ------------


100%|████████████████████████████████████████████████████████████████████████████████| 432/432 [22:52<00:00,  3.18s/it]
#write_seasonal_csv(a,seasonal_anime_filename)

The above operation scraped all seasonal pages available on the website in just 20 minutes, a significant improvement from the previous implementation which took multiple hours just to scrape a subset of Titles on the website.

In the output we see that we appear to have been rate limited once within the 20 minutes. As the implemented logic will retry the same request thrice and there are no errors that repeated for three times, I assume that all requests were successful. But as a sanity check we will investigate these errors.

# Collect logged failed urls
with open('seasonal.log') as f:
    f = f.readlines()
failed_urls = []
pattern = r'https://myanimelist.net/anime/season/[ a-zA-Z0-9./]+/[ a-zA-Z0-9./]+'
for line in f:
    url = re.findall(pattern, line)
    if url:
        failed_urls.append(url[0])
len(failed_urls)
5
# Remove duplicated urls
failed_urls = list(dict.fromkeys(failed_urls))
len(failed_urls)
4

4 unique URLs faced rate limiting errors during scraping, these are urls for 2001 and 2002 seasons

df = pd.read_csv('seasonal_anime.csv')
df.head()
MAL_Id Title Image Score Members Start_Date Episodes Duration Genres Studio Source Themes Demographic Synopsis Type
0 23189 Dekobou Shingachou: Meian no Shippai https://cdn.myanimelist.net/images/qm_50.gif 5.84 1544 19170200 1 5 ['Comedy'] Unknown Original [] [] A man first realizes he's born to be a samurai... Movie
1 17387 Imokawa Mukuzo: Genkanban no Maki https://cdn.myanimelist.net/images/qm_50.gif 5.26 1133 19170100 1 8 ['Comedy'] Unknown Original [] [] The third professionally produced Japanese ani... Movie
2 6654 Namakura Gatana https://cdn.myanimelist.net/images/anime/1959/... 5.50 9633 19170630 1 4 ['Comedy'] Unknown Original ['Samurai'] [] Namakura Gatana meaning "dull-edged sword " i... Movie
3 10742 Saru to Kani no Gassen https://cdn.myanimelist.net/images/anime/4/837... 4.93 1146 19170520 1 6 ['Drama'] Unknown Other [] [] A monkey tricks a crab and steals his food. Mo... Movie
4 24575 Yume no Jidousha https://cdn.myanimelist.net/images/qm_50.gif 5.62 623 19170500 1 0 [] Unknown Original ['Racing'] [] It is most likely a story about a great dream ... Movie
# impute missing days/months in start_date
def impute_day(date):
    if str(date)[-2:] == '00':
        date = str(date)[:-2] + '01'
    if str(date)[4:-2] == '00':
        date = str(date)[:4] + '01' + str(date)[-2:]
    return date
df.Start_Date = df.Start_Date.apply(impute_day)
df.head()
MAL_Id Title Image Score Members Start_Date Episodes Duration Genres Studio Source Themes Demographic Synopsis Type
0 23189 Dekobou Shingachou: Meian no Shippai https://cdn.myanimelist.net/images/qm_50.gif 5.84 1544 19170201 1 5 ['Comedy'] Unknown Original [] [] A man first realizes he's born to be a samurai... Movie
1 17387 Imokawa Mukuzo: Genkanban no Maki https://cdn.myanimelist.net/images/qm_50.gif 5.26 1133 19170101 1 8 ['Comedy'] Unknown Original [] [] The third professionally produced Japanese ani... Movie
2 6654 Namakura Gatana https://cdn.myanimelist.net/images/anime/1959/... 5.50 9633 19170630 1 4 ['Comedy'] Unknown Original ['Samurai'] [] Namakura Gatana meaning "dull-edged sword " i... Movie
3 10742 Saru to Kani no Gassen https://cdn.myanimelist.net/images/anime/4/837... 4.93 1146 19170520 1 6 ['Drama'] Unknown Other [] [] A monkey tricks a crab and steals his food. Mo... Movie
4 24575 Yume no Jidousha https://cdn.myanimelist.net/images/qm_50.gif 5.62 623 19170501 1 0 [] Unknown Original ['Racing'] [] It is most likely a story about a great dream ... Movie
df.Start_Date = pd.to_datetime(df.Start_Date, format='%Y%m%d')
df.shape
(28744, 15)

The Start_Date in the above dataframe has been imputed and converted to datetime format, now we can conduct our sanity checks on the years that produced the errors while scraping.

df[df.Start_Date.dt.year == 2001]['Start_Date'].dt.month.value_counts()
Start_Date
10    112
4      85
7      56
12     43
8      39
3      34
5      32
1      29
6      22
11     22
2      21
9      21
Name: count, dtype: int64
df[df.Start_Date.dt.year == 2002]['Start_Date'].dt.month.value_counts()
Start_Date
4     121
10     87
1      52
8      36
11     36
3      34
7      34
12     33
5      26
9      24
2      22
6      20
Name: count, dtype: int64

We see that both 2001 and 2002 have many number of titles released throughout each of the twelve months of the year, suggesting that all four seasons we successfully scraped for both years.

3. Active Users Scraping

Now we are going to explore selectively scraping active users that have rated a Title that we want to collect more ratings for. Together with the seasonal scraping from the previous section this will allow the option of quickly updating the capabilities of the Recommendation System to include the newest titles into consideration.

The below image shows a snippet of what this section looks like on the webpage. png

response = requests.get('https://myanimelist.net/anime/49458/Kono_Subarashii_Sekai_ni_Shukufuku_wo_3/stats?show=0#members',headers=req_head)
doc = BeautifulSoup(response.text)
row_contents = doc.find_all('table', {'class':'table-recently-updated'})

# We expect to see Username , Score, Status, Eps Seen, Activity
[x.text for x in row_contents[0].findAll('tr')[1].findAll('td')]
['dawid550', '-', 'Plan to Watch', '', '36 minutes ago']
# Loop through found users and collect users that have rated the Title
res_list = []
for i in range(len(row_contents[0].findAll('tr'))):
    res = [x.text for x in row_contents[0].findAll('tr')[i].findAll('td')]
    if res[1] != '-':
        res_list.append(res)
pd.DataFrame(res_list[1:], columns=res_list[0])
Member Score Status Eps Seen Activity
0 fxl2 6 Watching \n 4 / 11\n ... 37 minutes ago
1 Djimbe 10 Watching \n 3 / 11\n ... 38 minutes ago
2 Prettig 10 Watching \n 4 / 11\n ... 38 minutes ago
3 CzechAnime 10 Watching \n 5 / 11\n ... 42 minutes ago
4 Fajar38 7 Watching \n 4 / 11\n ... 42 minutes ago
5 Naitchu 9 Watching \n 4 / 11\n ... 43 minutes ago
6 MishMashMoshi 7 Watching \n 4 / 11\n ... 44 minutes ago
7 GriffonLord 8 Watching \n 4 / 11\n ... 45 minutes ago
8 Danderfluff 7 Watching \n 5 / 11\n ... 46 minutes ago
9 MrJast 10 Watching \n 11 / 11\n ... 46 minutes ago
10 kawaiigabz 9 Watching \n 3 / 11\n ... 48 minutes ago
11 boknight 6 Watching \n 1 / 11\n ... 49 minutes ago
12 fabian332 7 Watching \n 4 / 11\n ... 50 minutes ago
13 mkody 8 Watching \n 4 / 11\n ... 50 minutes ago
14 Divyansenpai69 8 Watching \n 4 / 11\n ... 50 minutes ago
15 human4ever 7 Watching \n 2 / 11\n ... 55 minutes ago
16 Travaughn13 10 Watching \n 4 / 11\n ... 55 minutes ago
17 dnilos911 9 Watching \n 4 / 11\n ... 58 minutes ago
18 LouLouLouLouLou 10 Watching \n - / 11\n ... 59 minutes ago
19 sorotomi97 8 Watching \n 3 / 11\n ... 1 hour ago
20 mrbacon56 10 Watching \n 4 / 11\n ... 1 hour ago
21 Plugma 8 Completed \n 11 / 11\n ... 1 hour ago
22 Kitto999 9 Watching \n 2 / 11\n ... 1 hour ago
23 KeerthiVasanG 8 Watching \n 5 / 11\n ... 1 hour ago
24 LeviAck25 10 Watching \n 1 / 11\n ... 1 hour ago
25 Koukyy 8 Watching \n 2 / 11\n ... 1 hour ago

We see that out of 75 users in the response we have 25 users who have rated the title! From here we can reuse our previous user scraping script to obtain these users’ rating lists through the official API of the website so our system can use that information for collaborative filtering.

Should we require additional number of user ratings we can increment the number of members in the url to scrape additional pages of recent interactions. By design the webpage will return pages of 75 members and we have scraped the page starting at interaction number 0 to 74. In our scripts we can increment it by 75 each time until the desired number of usernames are obtained.

Next we can quickly go through the possible Titles that we need to obtain additional user ratings information from.

user_ratings = pd.read_csv('cleaned_user_ratings.csv')
user_ratings[user_ratings.Rating_Score != 0].Anime_Id.unique()
print(f'Number of unique titles scraped : {len(df)}')
print(f'Number of unique titles rated in our user dataset : {user_ratings[user_ratings.Rating_Score != 0].Anime_Id.nunique()}')
Number of unique titles scraped : 18062
Number of unique titles rated in our user dataset : 14914
user_ratings[~user_ratings['Anime_Id'].isin(df.Id)].sort_values('Rating_Score', ascending=False)
Username User_Id Anime_Id Anime_Title Rating_Status Rating_Score Num_Epi_Watched Is_Rewatching Updated Start_Date
4946590 matti_god 18193 57435 Street Fighter 6 x Spy x Family Movie: Code: W... completed 10 1 False 2023-12-19 15:00:22+00:00 2023-12-04
3356659 ShiroAlex 12378 52745 Liella no Uta 2 completed 10 12 False 2024-04-03 10:35:45+00:00 NaN
531240 potatoxslayer 1938 37954 Neo-Aspect completed 10 1 False 2019-01-28 05:02:56+00:00 NaN
535345 45rfew 1953 32807 Xiong Chumo completed 10 104 False 2022-08-01 03:22:35+00:00 NaN
535348 45rfew 1953 32818 Xiong Chumo: Huanqiu Da Maoxian completed 10 104 False 2022-08-01 03:22:44+00:00 NaN
... ... ... ... ... ... ... ... ... ... ...
2143827 SwopaKing 7873 56906 Isekai de Cheat Skill wo Te ni Shita Ore wa, G... plan_to_watch 0 0 False 2023-10-15 12:33:58+00:00 NaN
2143639 M3m3supreme 7872 49233 Youjo Senki II plan_to_watch 0 0 False 2021-06-19 16:34:22+00:00 NaN
2143619 M3m3supreme 7872 34453 Uma Musume: Pretty Derby PV plan_to_watch 0 0 False 2018-04-03 15:47:56+00:00 NaN
2143567 M3m3supreme 7872 53065 Sono Bisque Doll wa Koi wo Suru (Zoku-hen) plan_to_watch 0 0 False 2022-09-18 10:22:21+00:00 NaN
5452183 mintcakee 20010 23057 Yukidoke plan_to_watch 0 0 False 2023-10-23 10:12:55+00:00 NaN

93341 rows × 10 columns

print(f"Number of titles in user rating data that does not appear in scraped seasonal data : {user_ratings[~user_ratings['Anime_Id'].isin(df.Id)].Anime_Id.nunique()}")
Number of titles in user rating data that does not appear in scraped seasonal data : 3036

We see approximately 3000 titles appearing in our user ratings dataset, but they do no exist within our scraped aired titles. Upon further investigation it appears that these titles are scheduled to release in the further, or are promotional videos that do not fall under the category of a “proper show” and hence are excluded from the seasonal roster.

print(f'Number of titles within scraped seasonal data missing from user rating data : {len(df[~df.Id.isin(user_ratings[user_ratings.Rating_Score != 0].Anime_Id.unique())])}')
Number of titles within scraped seasonal data missing from user rating data : 5502

With more than 5000 titles it means an equivalent number of requests will be required to obtain the recent user interactions information. Extrapolating from the time required to scrape our seasonal data this would mean almost 4 hours to go through all 5000+ titles!

As an alternative we can ignore the really obscure Titles that also have low rating scores under the assumption that watchers would be less likely to enjoy them anyway. The likelihood of these titles getting recommended is also low as they would not rank high during collaborative filtering due to their low scores and low number of ratings.

df[(~df.Id.isin(user_ratings[user_ratings.Rating_Score != 0].Anime_Id.unique()))].describe()
Id Score Members Start_Date Duration
count 5502.000000 5502.000000 5502.00000 5502 5502.000000
mean 33578.684842 2.237177 1041.49691 2008-09-22 08:29:50.185387008 24.303708
min 217.000000 0.000000 7.00000 1917-05-01 00:00:00 0.000000
25% 18629.500000 0.000000 168.00000 2001-11-26 12:00:00 4.000000
50% 36669.500000 0.000000 393.00000 2013-06-09 00:00:00 13.000000
75% 48209.750000 5.670000 859.50000 2018-08-17 18:00:00 29.000000
max 58805.000000 7.970000 132342.00000 2024-12-01 00:00:00 167.000000
std 17495.517912 2.899072 3890.49838 NaN 27.931342

For demonstration purposes I would use 2000 members and a score of at least 6 as our filter.

df[(~df.Id.isin(user_ratings[user_ratings.Rating_Score != 0].Anime_Id.unique())) & (df.Members > 2000) & (df.Score >= 6)]
Id Title Image Score Members Start_Date Episodes Duration Genre Studio Source Themes Demographic Synopsis
133 4948 Shounen Sarutobi Sasuke https://cdn.myanimelist.net/images/anime/1266/... 6.27 2238 1959-12-25 1 82 ['Adventure' 'Fantasy'] Toei Animation Original [] [] Magic Boy was the first ever Japanese animatio...
149 2686 Tetsujin 28-gou https://cdn.myanimelist.net/images/anime/8/717... 6.94 3872 1963-10-20 96 25 ['Adventure' 'Sci-Fi'] Eiken Manga ['Mecha'] ['Shounen'] Dr.Haneda was developing experimental giant ro...
209 3900 Ougon Bat https://cdn.myanimelist.net/images/anime/2/286... 6.88 2816 1967-04-01 52 25 ['Action' 'Sci-Fi'] sDai-Ichi DougaDongyang Animation Novel ['Super ' 'Power'] [] A golden warrior wearing a cape and a scepter ...
229 5834 Kyojin no Hoshi https://cdn.myanimelist.net/images/anime/12/59... 7.47 3206 1968-03-30 182 25 ['Drama' 'Sports'] TMS Entertainment Manga ['Team ' 'Sports'] ['Shounen'] The story is about Hyuma Hoshi a promising yo...
238 5997 Sabu to Ichi Torimono Hikae https://cdn.myanimelist.net/images/anime/9/840... 7.04 2748 1968-10-03 52 25 ['Action' 'Adventure' 'Drama' 'Slice of Life'] Toei Animation Manga ['Detective' 'Historical' 'Martial ' 'Arts'... ['Shounen'] The series follows the adventures of Sabu a y...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
17902 56840 T.P BON https://cdn.myanimelist.net/images/anime/1003/... 6.95 3693 2024-05-02 ? 28 ['Action' 'Adventure'] Bones Manga ['Time ' 'Travel'] [] An ordinary high school student named Bon beco...
17904 58689 Yuanshen: Jinzhong Ge https://cdn.myanimelist.net/images/anime/1229/... 7.97 2176 2024-04-17 1 7 ['Action' 'Drama' 'Fantasy'] Unknown Game [] [] Animated short film about the backstory of the...
17940 54866 Blue Lock: Episode Nagi https://cdn.myanimelist.net/images/anime/1239/... 6.86 36190 2024-04-19 1 91 ['Sports'] 8bit Manga ['Team ' 'Sports'] ['Shounen'] A spin-off series of Blue Lock focusing on Sei...
17946 57478 Kuramerukagari https://cdn.myanimelist.net/images/anime/1764/... 6.33 5131 2024-04-12 1 61 ['Mystery' 'Sci-Fi' 'Suspense'] Team OneOne Original ['Detective'] [] This is a story that weaves together people an...
17949 56553 Kurayukaba https://cdn.myanimelist.net/images/anime/1885/... 6.48 4702 2024-04-12 1 63 ['Mystery' 'Sci-Fi' 'Suspense'] Team OneOne Original ['Detective'] [] Business is slow for the Ootsuji Detective Age...

301 rows × 14 columns

Instantly we have cut the number of Titles we want to scrape down to 301, which will complete in less than 15 minutes.

With this, we can expect to collect enough information to add Titles from a new season to our recommendation system within 30 minutes! Approximately 200 requests (1 to obtain seasonal titles + recent user interactions from average of 150 titles released per season) which would take about 10 minutes, and an additional 20 minutes to obtain the user ratings data from the official API which is more lenient with its rate limiting in my experience.

4. Conclusion

In this notebook we have explore an improved implementation of our previous webscraping approach to obtaining content information and user ratings information from the website. This implementation cuts down the time taken to scrape our content information by over 90%, taking only 20 minutes to scrape the entire site now.

This targeted user approach improves the quality of user rating data that we will obtain. Previously we scraped user data indiscriminately as long as they are recently active, resulting in a lot of obscure titles missing from the dataset, and a lot of wasted API calls when the active user scraped does not maintain a useful rating list. The updated approach obtains the usernames directly from the Title’s recent interactions list, ensuring that the scraped user ratings data will at least contain information of the Title of interest.