Anime webscraping - Part 3

Introduction
Seasonal Title Scraping
Active Users Scraping
Conclusion

1. Introduction

In previous notebooks we have explored scraping a popular anime database/community MyAnimeList for content and user rating information. After some EDA and exploration of Recommendation Systems using the scraped data I have thought of ways to improve our approach to how the data is scraped.

Limitations of the previous implementation that we want to address here:

Long run times due to multiple requests required for each Title in serial
Sampling a small subset of random active users do not return enough rating data for less popular Titles.

Proposed solutions to the above:

Make requests in parallel, reduce number of requests required per Title by 3 orders of magnitude. Achieved by scraping areas of the website listing information for entire seasons of Titles within a single webpage, instead of scraping multiple subpages for each Title that was done prior.
Identify Titles that we want more ratings data for, and from the Title’s subpage we directly scrape users that have recent interactions with the Title.

In this notebook I will walk through the implementation of the improved approach, the goal being to improve scraping performance and the quality of information that the scripts are returning. This new implementation will also allow us to easily update our anime content/user ratings datasets that are being used by the recommendation system, quickly keeping it up-to-date whenever a new season of titles is released.

2. Seasonal Title Scraping

import os
from bs4 import BeautifulSoup
import requests
import time
import pandas as pd
import random
import re
import csv
from multiprocessing.pool import ThreadPool as Pool

import logging
import tqdm

# Get request from site
site_url = 'https://myanimelist.net'
top_anime_url = site_url + '/anime/season/'
response = requests.get(top_anime_url + '2023/winter')
response.status_code

The above webpage contains summary information of all the anime titles that were released during the season, allowing us to scrape relevant data from a large number of titles with a single request, in constrast sending at least one request for each title.

png

# Extract html information from the webpage
doc = BeautifulSoup(response.text)

# Extract relevant portion of the webpage
#row_contents = doc.find_all('div', {'class':'js-anime-category-producer'})
type_contents = doc.find_all('div', {'class':'seasonal-anime-list'})

len(type_contents)

total = 0
for i in range(len(type_contents)):
    print('Media Type :', type_contents[i].find('div', {'class':'anime-header'}).text)
    print('Number of Titles: ', len(type_contents[i].find_all('div', {'class':'js-anime-category-producer'})))
    total += len(type_contents[i].find_all('div',{'class':'js-anime-category-producer'}))
print(f'Total Number of Titles This Season: {total}')
          

Media Type : TV (New)
Number of Titles:  62
Media Type : TV (Continuing)
Number of Titles:  64
Media Type : ONA
Number of Titles:  83
Media Type : OVA
Number of Titles:  11
Media Type : Movie
Number of Titles:  25
Media Type : Special
Number of Titles:  11
Total Number of Titles This Season: 256

In the above webpage for 2023 Winter season there are 6 different media types with a total of 256 titles in the single response we received.

row_contents = type_contents[0].find_all('div', {'class':'js-anime-category-producer'})

# Starting Date, Number of Episodes, Episode Duration
[x.replace(' ','') for x in row_contents[0].find('div', {'class':'prodsrc'}).text.split('\n') if x.replace(' ','') != '']

['Jan10,2023', '24eps,', '25min']

# Genre
[x.text.strip() for x in row_contents[0].findAll('span', {'class': 'genre'})]

['Action', 'Adventure', 'Drama']

# Process Number of Episodes and Episode Duration into their own dictionary
def process_prodsrc(row_content):
    content = [x.replace(' ','') for x in row_content.find('div', {'class':'prodsrc'}).text.split('\n') if x.replace(' ','') != '']
    content_dict = {'Episodes':0,'Duration':0}
    for c in content:
        if ('ep' in c or 'eps' in c) and 'Sep' not in c:
            content_dict['Episodes'] = c.replace('eps', '').replace('ep','').replace(',','')
        elif 'min' in c:
            content_dict['Duration'] = c.replace('min', '')
    return content_dict

# Studio, Source, Themes, Demographic
[x for x in row_contents[0].find('div', {'class':'properties'}).text.split('\n') if x != '']

['StudioMAPPA', 'SourceManga', 'ThemesGoreHistorical', 'DemographicSeinen']

# Process Studio, Source, Themes, Demographic into their own dictionary
def process_properties(row_content):
    content = [x for x in row_content.find('div', {'class':'properties'}).text.split('\n') if x != '']
    content_dict = {'Studio':'', 'Source': '', 'Theme': '', 'Demographic': ''}
    for c in content:
        for k in content_dict.keys():
            if k in c:
                content_dict[k] = c.replace(k, '')
    return content_dict

# Clean Synopsis text
def clean_text(text):
    text = text.replace('\t','').replace('\n',' ').replace('\r',' ')
    text = re.sub(' +', ' ',text).rstrip('\\').strip()
    return text

# Cleaned Synopsis
clean_text(row_contents[0].find('div', {'class':'synopsis'}).text)

"After his father's death and the destruction of his village at the hands of English raiders, Einar wishes for a peaceful life with his family on their newly rebuilt farms. However, fate has other plans: his village is invaded once again. Einar watches helplessly as the marauding Danes burn his lands and slaughter his family. The invaders capture Einar and take him back to Denmark as a slave. Einar clings to his mother's final words to survive. He is purchased by Ketil, a kind slave owner and landlord who promises that Einar can regain his freedom in return for working in the fields. Soon, Einar encounters his new partner in farm cultivation—Thorfinn, a dejected and melancholic slave. As Einar and Thorfinn work together toward their freedom, they are haunted by both sins of the past and the ploys of the present. Yet they carry on, grasping for a glimmer of hope, redemption, and peace in a world that is nothing but unjust and unforgiving. [Written by MAL Rewrite] StudioMAPPA SourceManga ThemesGoreHistorical DemographicSeinen"

# Function to create a dictionary containing all the above information
def extract_info(row_contents, mediatype=''):
    seasonal_contents = []
    for i in range(len(row_contents)):
        prodsrc = process_prodsrc(row_contents[i])
        properties = process_properties(row_contents[i])
        id_ = row_contents[i].find('div', {'class':'genres'})
        title_ = row_contents[i].find('span', {'class':'js-title'})
        score_ = row_contents[i].find('span', {'class':'js-score'})
        members_ = row_contents[i].find('span', {'class':'js-members'})
        start_date_ = row_contents[i].find('span', {'class':'js-start_date'})
        image_ = row_contents[i].find('img')
        
        contents = {
            'MAL_Id': id_.get('id', -1) if id_ else '',
            'Title': title_.text if title_ else '',
            'Image': image_.get('src','') or image_.get('data-src','') if image_ else '',
            'Score': score_.text if score_ else '',
            'Members': members_.text if members_ else '',
            'Start_Date': start_date_.text if start_date_ else '',
            'Episodes': prodsrc['Episodes'],
            'Duration': prodsrc['Duration'],
            'Genres': [x.text.strip() for x in row_contents[i].findAll('span', {'class': 'genre'})],
            'Studio': properties['Studio'],
            'Source': properties['Source'],
            'Themes': re.findall('[A-Z][^A-Z]*', properties['Theme']),
            'Demographic': re.findall('[A-Z][^A-Z]*', properties['Demographic']),
            'Synopsis': clean_text(row_contents[i].find('div', {'class':'synopsis'}).text),
            'Type': mediatype
        }
        seasonal_contents.append(contents)
    return seasonal_contents

seasonal_anime = extract_info(row_contents, 'TV')
a = pd.DataFrame(seasonal_anime)
a.head()

	MAL_Id	Title	Image	Score	Members	Start_Date	Episodes	Duration	Genres	Studio	Source	Themes	Demographic	Synopsis	Type
0	49387	Vinland Saga Season 2	https://cdn.myanimelist.net/images/anime/1170/...	8.81	608171	20230110	24	25	[Action, Adventure, Drama]	MAPPA	Manga	[Gore, Historical]	[Seinen]	After his father's death and the destruction o...	TV
1	52305	Tomo-chan wa Onnanoko!	https://cdn.myanimelist.net/images/anime/1444/...	7.79	392520	20230105	13	23	[Comedy, Romance]	Lay-duce	Web manga	[School]	[]	Childhood friends Tomo Aizawa and Junichirou "...	TV
2	50608	Tokyo Revengers: Seiya Kessen-hen	https://cdn.myanimelist.net/images/anime/1773/...	7.67	352516	20230108	13	23	[Action, Drama, Supernatural]	LIDENFILMS	Manga	[Delinquents, Time , Travel]	[Shounen]	In spite of his best time-leaping efforts, Tak...	TV
3	48417	Maou Gakuin no Futekigousha II: Shijou Saikyou...	https://cdn.myanimelist.net/images/anime/1369/...	6.90	336011	20230108	12	23	[Action, Fantasy]	SILVER LINK.	Light novel	[Mythology, Reincarnation, School]	[]	As peace returns to the demon realm, Anos Vold...	TV
4	50739	Otonari no Tenshi-sama ni Itsunomanika Dame Ni...	https://cdn.myanimelist.net/images/anime/1240/...	7.82	309405	20230107	12	23	[Romance]	Project No.9	Light novel	[School]	[]	Mahiru Shiina is worthy of her nickname "Angel...	TV

Our function appears to be working, collecting the relevant information into a dictionary that can be easily converted into a pandas DataFrame.

Next we will include a few useful functions that we will use when scraping all the Titles available on the website, including some logging functionalities to help us track and debug any issues that may occur during the process.

# Helper functions
### Implement randomized sleep time in between requests to reduce chance of being blocked from site
def sleep(t=3):
    rand_t = random.random() * (t) +0.5
    time.sleep(rand_t)
    
### Save our dictionary to a .csv file
def write_seasonal_csv(items, path):
    written_id = set()
    
    # Assign header names with handling of seasons with no new release in certain media types
    for i in range(len(items)):
        if items[i]:
            headers = list(items[i][0].keys())
            break
    
    # In case no new titles released
    if headers:
        # Open the file in write mode
        if not path in os.listdir():
            with open(path, 'w', encoding='utf-8') as f:
                # Return if there's nothing to write
                if len(items) == 0:
                    return
                # Write the headers in the first line
                f.write('|'.join(headers) + '\n')

        with open(path, 'a', encoding='utf-8') as f:
            # Write one item per line
            for i in range(len(items)):
                for item in items[i]:
                    values = []
                    # Check if title has already been added to prevent duplicated entries, some shows span multiple seasons
                    if item.get('MAL_Id') in written_id:
                        continue
                    for header in headers:
                        values.append(str(item.get(header, "")).replace('|',''))
                    f.write('|'.join(values) + "\n")          
                    written_id.add(item.get('Id'))

### Send request to website
def get_response(url):    
    # Try for up to 3 times per URL
    for _ in range(3):
        try:
            sleep(3)
            response = requests.get(url, headers=req_head)
            # If response is good we return the BS object for further processing
            if response.status_code == 200:
                doc = BeautifulSoup(response.text)
                row_contents = doc.find_all('div', {'class':'seasonal-anime-list'})
                if row_contents is None:
                    logging.warning(f'row_contents is None for {url}')
                    print(f'----------- row_contents is None for {url} ------------')
                return row_contents
            
            # If response suggests we are rate limited, make this thread sleep for ~3 minutes before continuing on next loop
            elif response.status_code == 429 or response.status_code == 405:
                logging.warning(f'{response.status_code} for {url}')
                print(f'----------- {response.status_code} occured for {url} ------------')
                buffer_t = random.random() * (40) + 160
                sleep(buffer_t)
                continue
            
            # Any other unexpected response
            else:
                logging.warning(f'{response.status_code} for {url}')
                print(f'----------- {response.status_code} occured for {url} ------------')
                sleep(5)
                continue
        
        # Any unexpected issues with sending request
        except:
            logging.error('Error trying to send request')
            buffer_t = random.random() * (40) + 100
            sleep(buffer_t)
            continue            
    print("-----------------------------Error sending request-----------------------------")
    print(time.asctime())

    
# Instantiate variables
start_year, end_year = 1917, 2024
seasons = ['winter','spring','summer','fall']
req_head = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:124.0) Gecko/20100101 Firefox/124.0',
           'X-MAL-CLIENT-ID':'e09c24c7eb88c3f399d9bd1355b4e015'}
seasonal_anime_filename = 'seasonal_anime.csv'

logging.basicConfig(filename='seasonal.log', filemode='w', format='%(asctime)s - %(levelname)s - %(message)s')

# Scrape all URLs between start and end years for specified seasons, 
# multiprocessing available, option to override with a specified list of url available as well
def scrape(file_name, start_year=1917, end_year=2024, seasons=['winter','spring','summer','fall'], req=req_head, nprocesses=4, url_list=None):
    top_anime_url = 'https://myanimelist.net/anime/season/'
    
    # If specific URLs are not provided, a list of URLs will be generated based on start/end years and seasons provided.
    if not url_list:
        url_list = [top_anime_url + str(year) + '/' + str(season) for year in range(start_year,end_year+1) for season in seasons]
    
    anime_list = []
    # nprocesses number of threads processing URL list in sequence parallelly
    with Pool(processes=nprocesses) as pool:
        for type_contents in tqdm.tqdm(pool.imap(get_response, url_list), total=len(url_list)):
            if type_contents is None:
                continue
            for i in range(len(type_contents)):
                row_contents = type_contents[i].find_all('div', {'class':'js-anime-category-producer'})
                mediatype = type_contents[i].find('div', {'class':'anime-header'}).text
                seasonal_contents = extract_info(row_contents, mediatype)
                anime_list.append(seasonal_contents)
            sleep(5) # a few seconds sleep before next request is sent to avoid rate limit by site
    
    # Write scraped data to disk
    write_seasonal_csv(anime_list, file_name)
    return anime_list

a = scrape(seasonal_anime_filename, start_year = 1917, end_year = 2024)

 21%|████████████████▋                                                                | 89/432 [04:48<17:23,  3.04s/it]

----------- 405 occured for https://myanimelist.net/anime/season/2000/fall ------------
----------- 405 occured for https://myanimelist.net/anime/season/2001/spring ------------
----------- 405 occured for https://myanimelist.net/anime/season/2001/winter ------------

 21%|████████████████▉                                                                | 90/432 [04:52<19:47,  3.47s/it]

----------- 405 occured for https://myanimelist.net/anime/season/2001/summer ------------

 22%|█████████████████▋                                                               | 94/432 [05:04<15:54,  2.83s/it]

----------- 405 occured for https://myanimelist.net/anime/season/2001/spring ------------

100%|████████████████████████████████████████████████████████████████████████████████| 432/432 [22:52<00:00,  3.18s/it]

#write_seasonal_csv(a,seasonal_anime_filename)

The above operation scraped all seasonal pages available on the website in just 20 minutes, a significant improvement from the previous implementation which took multiple hours just to scrape a subset of Titles on the website.

In the output we see that we appear to have been rate limited once within the 20 minutes. As the implemented logic will retry the same request thrice and there are no errors that repeated for three times, I assume that all requests were successful. But as a sanity check we will investigate these errors.

# Collect logged failed urls
with open('seasonal.log') as f:
    f = f.readlines()
failed_urls = []
pattern = r'https://myanimelist.net/anime/season/[ a-zA-Z0-9./]+/[ a-zA-Z0-9./]+'
for line in f:
    url = re.findall(pattern, line)
    if url:
        failed_urls.append(url[0])
len(failed_urls)

# Remove duplicated urls
failed_urls = list(dict.fromkeys(failed_urls))
len(failed_urls)

4 unique URLs faced rate limiting errors during scraping, these are urls for 2001 and 2002 seasons

df = pd.read_csv('seasonal_anime.csv')
df.head()

	MAL_Id	Title	Image	Score	Members	Start_Date	Episodes	Duration	Genres	Studio	Source	Themes	Demographic	Synopsis	Type
0	23189	Dekobou Shingachou: Meian no Shippai	https://cdn.myanimelist.net/images/qm_50.gif	5.84	1544	19170200	1	5	['Comedy']	Unknown	Original	[]	[]	A man first realizes he's born to be a samurai...	Movie
1	17387	Imokawa Mukuzo: Genkanban no Maki	https://cdn.myanimelist.net/images/qm_50.gif	5.26	1133	19170100	1	8	['Comedy']	Unknown	Original	[]	[]	The third professionally produced Japanese ani...	Movie
2	6654	Namakura Gatana	https://cdn.myanimelist.net/images/anime/1959/...	5.50	9633	19170630	1	4	['Comedy']	Unknown	Original	['Samurai']	[]	Namakura Gatana meaning "dull-edged sword " i...	Movie
3	10742	Saru to Kani no Gassen	https://cdn.myanimelist.net/images/anime/4/837...	4.93	1146	19170520	1	6	['Drama']	Unknown	Other	[]	[]	A monkey tricks a crab and steals his food. Mo...	Movie
4	24575	Yume no Jidousha	https://cdn.myanimelist.net/images/qm_50.gif	5.62	623	19170500	1	0	[]	Unknown	Original	['Racing']	[]	It is most likely a story about a great dream ...	Movie

# impute missing days/months in start_date
def impute_day(date):
    if str(date)[-2:] == '00':
        date = str(date)[:-2] + '01'
    if str(date)[4:-2] == '00':
        date = str(date)[:4] + '01' + str(date)[-2:]
    return date

df.Start_Date = df.Start_Date.apply(impute_day)
df.head()

	MAL_Id	Title	Image	Score	Members	Start_Date	Episodes	Duration	Genres	Studio	Source	Themes	Demographic	Synopsis	Type
0	23189	Dekobou Shingachou: Meian no Shippai	https://cdn.myanimelist.net/images/qm_50.gif	5.84	1544	19170201	1	5	['Comedy']	Unknown	Original	[]	[]	A man first realizes he's born to be a samurai...	Movie
1	17387	Imokawa Mukuzo: Genkanban no Maki	https://cdn.myanimelist.net/images/qm_50.gif	5.26	1133	19170101	1	8	['Comedy']	Unknown	Original	[]	[]	The third professionally produced Japanese ani...	Movie
2	6654	Namakura Gatana	https://cdn.myanimelist.net/images/anime/1959/...	5.50	9633	19170630	1	4	['Comedy']	Unknown	Original	['Samurai']	[]	Namakura Gatana meaning "dull-edged sword " i...	Movie
3	10742	Saru to Kani no Gassen	https://cdn.myanimelist.net/images/anime/4/837...	4.93	1146	19170520	1	6	['Drama']	Unknown	Other	[]	[]	A monkey tricks a crab and steals his food. Mo...	Movie
4	24575	Yume no Jidousha	https://cdn.myanimelist.net/images/qm_50.gif	5.62	623	19170501	1	0	[]	Unknown	Original	['Racing']	[]	It is most likely a story about a great dream ...	Movie

df.Start_Date = pd.to_datetime(df.Start_Date, format='%Y%m%d')
df.shape

(28744, 15)

The Start_Date in the above dataframe has been imputed and converted to datetime format, now we can conduct our sanity checks on the years that produced the errors while scraping.

df[df.Start_Date.dt.year == 2001]['Start_Date'].dt.month.value_counts()

Start_Date
  112
    85
    56
   43
    39
    34
    32
    29
    22
   22
    21
    21
Name: count, dtype: int64

df[df.Start_Date.dt.year == 2002]['Start_Date'].dt.month.value_counts()

Start_Date
   121
   87
    52
    36
   36
    34
    34
   33
    26
    24
    22
    20
Name: count, dtype: int64

We see that both 2001 and 2002 have many number of titles released throughout each of the twelve months of the year, suggesting that all four seasons we successfully scraped for both years.

3. Active Users Scraping

Now we are going to explore selectively scraping active users that have rated a Title that we want to collect more ratings for. Together with the seasonal scraping from the previous section this will allow the option of quickly updating the capabilities of the Recommendation System to include the newest titles into consideration.

The below image shows a snippet of what this section looks like on the webpage. png

response = requests.get('https://myanimelist.net/anime/49458/Kono_Subarashii_Sekai_ni_Shukufuku_wo_3/stats?show=0#members',headers=req_head)
doc = BeautifulSoup(response.text)
row_contents = doc.find_all('table', {'class':'table-recently-updated'})

# We expect to see Username , Score, Status, Eps Seen, Activity
[x.text for x in row_contents[0].findAll('tr')[1].findAll('td')]

['dawid550', '-', 'Plan to Watch', '', '36 minutes ago']

# Loop through found users and collect users that have rated the Title
res_list = []
for i in range(len(row_contents[0].findAll('tr'))):
    res = [x.text for x in row_contents[0].findAll('tr')[i].findAll('td')]
    if res[1] != '-':
        res_list.append(res)
pd.DataFrame(res_list[1:], columns=res_list[0])

	Member	Score	Status	Eps Seen	Activity
0	fxl2	6	Watching	\n 4 / 11\n ...	37 minutes ago
1	Djimbe	10	Watching	\n 3 / 11\n ...	38 minutes ago
2	Prettig	10	Watching	\n 4 / 11\n ...	38 minutes ago
3	CzechAnime	10	Watching	\n 5 / 11\n ...	42 minutes ago
4	Fajar38	7	Watching	\n 4 / 11\n ...	42 minutes ago
5	Naitchu	9	Watching	\n 4 / 11\n ...	43 minutes ago
6	MishMashMoshi	7	Watching	\n 4 / 11\n ...	44 minutes ago
7	GriffonLord	8	Watching	\n 4 / 11\n ...	45 minutes ago
8	Danderfluff	7	Watching	\n 5 / 11\n ...	46 minutes ago
9	MrJast	10	Watching	\n 11 / 11\n ...	46 minutes ago
10	kawaiigabz	9	Watching	\n 3 / 11\n ...	48 minutes ago
11	boknight	6	Watching	\n 1 / 11\n ...	49 minutes ago
12	fabian332	7	Watching	\n 4 / 11\n ...	50 minutes ago
13	mkody	8	Watching	\n 4 / 11\n ...	50 minutes ago
14	Divyansenpai69	8	Watching	\n 4 / 11\n ...	50 minutes ago
15	human4ever	7	Watching	\n 2 / 11\n ...	55 minutes ago
16	Travaughn13	10	Watching	\n 4 / 11\n ...	55 minutes ago
17	dnilos911	9	Watching	\n 4 / 11\n ...	58 minutes ago
18	LouLouLouLouLou	10	Watching	\n - / 11\n ...	59 minutes ago
19	sorotomi97	8	Watching	\n 3 / 11\n ...	1 hour ago
20	mrbacon56	10	Watching	\n 4 / 11\n ...	1 hour ago
21	Plugma	8	Completed	\n 11 / 11\n ...	1 hour ago
22	Kitto999	9	Watching	\n 2 / 11\n ...	1 hour ago
23	KeerthiVasanG	8	Watching	\n 5 / 11\n ...	1 hour ago
24	LeviAck25	10	Watching	\n 1 / 11\n ...	1 hour ago
25	Koukyy	8	Watching	\n 2 / 11\n ...	1 hour ago

We see that out of 75 users in the response we have 25 users who have rated the title! From here we can reuse our previous user scraping script to obtain these users’ rating lists through the official API of the website so our system can use that information for collaborative filtering.

Should we require additional number of user ratings we can increment the number of members in the url to scrape additional pages of recent interactions. By design the webpage will return pages of 75 members and we have scraped the page starting at interaction number 0 to 74. In our scripts we can increment it by 75 each time until the desired number of usernames are obtained.

Next we can quickly go through the possible Titles that we need to obtain additional user ratings information from.

user_ratings = pd.read_csv('cleaned_user_ratings.csv')
user_ratings[user_ratings.Rating_Score != 0].Anime_Id.unique()
print(f'Number of unique titles scraped : {len(df)}')
print(f'Number of unique titles rated in our user dataset : {user_ratings[user_ratings.Rating_Score != 0].Anime_Id.nunique()}')

Number of unique titles scraped : 18062
Number of unique titles rated in our user dataset : 14914

user_ratings[~user_ratings['Anime_Id'].isin(df.Id)].sort_values('Rating_Score', ascending=False)

	Username	User_Id	Anime_Id	Anime_Title	Rating_Status	Rating_Score	Num_Epi_Watched	Is_Rewatching	Updated	Start_Date
4946590	matti_god	18193	57435	Street Fighter 6 x Spy x Family Movie: Code: W...	completed	10	1	False	2023-12-19 15:00:22+00:00	2023-12-04
3356659	ShiroAlex	12378	52745	Liella no Uta 2	completed	10	12	False	2024-04-03 10:35:45+00:00	NaN
531240	potatoxslayer	1938	37954	Neo-Aspect	completed	10	1	False	2019-01-28 05:02:56+00:00	NaN
535345	45rfew	1953	32807	Xiong Chumo	completed	10	104	False	2022-08-01 03:22:35+00:00	NaN
535348	45rfew	1953	32818	Xiong Chumo: Huanqiu Da Maoxian	completed	10	104	False	2022-08-01 03:22:44+00:00	NaN
...	...	...	...	...	...	...	...	...	...	...
2143827	SwopaKing	7873	56906	Isekai de Cheat Skill wo Te ni Shita Ore wa, G...	plan_to_watch	0	0	False	2023-10-15 12:33:58+00:00	NaN
2143639	M3m3supreme	7872	49233	Youjo Senki II	plan_to_watch	0	0	False	2021-06-19 16:34:22+00:00	NaN
2143619	M3m3supreme	7872	34453	Uma Musume: Pretty Derby PV	plan_to_watch	0	0	False	2018-04-03 15:47:56+00:00	NaN
2143567	M3m3supreme	7872	53065	Sono Bisque Doll wa Koi wo Suru (Zoku-hen)	plan_to_watch	0	0	False	2022-09-18 10:22:21+00:00	NaN
5452183	mintcakee	20010	23057	Yukidoke	plan_to_watch	0	0	False	2023-10-23 10:12:55+00:00	NaN

93341 rows × 10 columns

print(f"Number of titles in user rating data that does not appear in scraped seasonal data : {user_ratings[~user_ratings['Anime_Id'].isin(df.Id)].Anime_Id.nunique()}")

Number of titles in user rating data that does not appear in scraped seasonal data : 3036

We see approximately 3000 titles appearing in our user ratings dataset, but they do no exist within our scraped aired titles. Upon further investigation it appears that these titles are scheduled to release in the further, or are promotional videos that do not fall under the category of a “proper show” and hence are excluded from the seasonal roster.

print(f'Number of titles within scraped seasonal data missing from user rating data : {len(df[~df.Id.isin(user_ratings[user_ratings.Rating_Score != 0].Anime_Id.unique())])}')

Number of titles within scraped seasonal data missing from user rating data : 5502

With more than 5000 titles it means an equivalent number of requests will be required to obtain the recent user interactions information. Extrapolating from the time required to scrape our seasonal data this would mean almost 4 hours to go through all 5000+ titles!

As an alternative we can ignore the really obscure Titles that also have low rating scores under the assumption that watchers would be less likely to enjoy them anyway. The likelihood of these titles getting recommended is also low as they would not rank high during collaborative filtering due to their low scores and low number of ratings.

df[(~df.Id.isin(user_ratings[user_ratings.Rating_Score != 0].Anime_Id.unique()))].describe()

	Id	Score	Members	Start_Date	Duration
count	5502.000000	5502.000000	5502.00000	5502	5502.000000
mean	33578.684842	2.237177	1041.49691	2008-09-22 08:29:50.185387008	24.303708
min	217.000000	0.000000	7.00000	1917-05-01 00:00:00	0.000000
25%	18629.500000	0.000000	168.00000	2001-11-26 12:00:00	4.000000
50%	36669.500000	0.000000	393.00000	2013-06-09 00:00:00	13.000000
75%	48209.750000	5.670000	859.50000	2018-08-17 18:00:00	29.000000
max	58805.000000	7.970000	132342.00000	2024-12-01 00:00:00	167.000000
std	17495.517912	2.899072	3890.49838	NaN	27.931342

For demonstration purposes I would use 2000 members and a score of at least 6 as our filter.

df[(~df.Id.isin(user_ratings[user_ratings.Rating_Score != 0].Anime_Id.unique())) & (df.Members > 2000) & (df.Score >= 6)]

	Id	Title	Image	Score	Members	Start_Date	Episodes	Duration	Genre	Studio	Source	Themes	Demographic	Synopsis
133	4948	Shounen Sarutobi Sasuke	https://cdn.myanimelist.net/images/anime/1266/...	6.27	2238	1959-12-25	1	82	['Adventure' 'Fantasy']	Toei Animation	Original	[]	[]	Magic Boy was the first ever Japanese animatio...
149	2686	Tetsujin 28-gou	https://cdn.myanimelist.net/images/anime/8/717...	6.94	3872	1963-10-20	96	25	['Adventure' 'Sci-Fi']	Eiken	Manga	['Mecha']	['Shounen']	Dr.Haneda was developing experimental giant ro...
209	3900	Ougon Bat	https://cdn.myanimelist.net/images/anime/2/286...	6.88	2816	1967-04-01	52	25	['Action' 'Sci-Fi']	sDai-Ichi DougaDongyang Animation	Novel	['Super ' 'Power']	[]	A golden warrior wearing a cape and a scepter ...
229	5834	Kyojin no Hoshi	https://cdn.myanimelist.net/images/anime/12/59...	7.47	3206	1968-03-30	182	25	['Drama' 'Sports']	TMS Entertainment	Manga	['Team ' 'Sports']	['Shounen']	The story is about Hyuma Hoshi a promising yo...
238	5997	Sabu to Ichi Torimono Hikae	https://cdn.myanimelist.net/images/anime/9/840...	7.04	2748	1968-10-03	52	25	['Action' 'Adventure' 'Drama' 'Slice of Life']	Toei Animation	Manga	['Detective' 'Historical' 'Martial ' 'Arts'...	['Shounen']	The series follows the adventures of Sabu a y...
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
17902	56840	T.P BON	https://cdn.myanimelist.net/images/anime/1003/...	6.95	3693	2024-05-02	?	28	['Action' 'Adventure']	Bones	Manga	['Time ' 'Travel']	[]	An ordinary high school student named Bon beco...
17904	58689	Yuanshen: Jinzhong Ge	https://cdn.myanimelist.net/images/anime/1229/...	7.97	2176	2024-04-17	1	7	['Action' 'Drama' 'Fantasy']	Unknown	Game	[]	[]	Animated short film about the backstory of the...
17940	54866	Blue Lock: Episode Nagi	https://cdn.myanimelist.net/images/anime/1239/...	6.86	36190	2024-04-19	1	91	['Sports']	8bit	Manga	['Team ' 'Sports']	['Shounen']	A spin-off series of Blue Lock focusing on Sei...
17946	57478	Kuramerukagari	https://cdn.myanimelist.net/images/anime/1764/...	6.33	5131	2024-04-12	1	61	['Mystery' 'Sci-Fi' 'Suspense']	Team OneOne	Original	['Detective']	[]	This is a story that weaves together people an...
17949	56553	Kurayukaba	https://cdn.myanimelist.net/images/anime/1885/...	6.48	4702	2024-04-12	1	63	['Mystery' 'Sci-Fi' 'Suspense']	Team OneOne	Original	['Detective']	[]	Business is slow for the Ootsuji Detective Age...

301 rows × 14 columns

Instantly we have cut the number of Titles we want to scrape down to 301, which will complete in less than 15 minutes.

With this, we can expect to collect enough information to add Titles from a new season to our recommendation system within 30 minutes! Approximately 200 requests (1 to obtain seasonal titles + recent user interactions from average of 150 titles released per season) which would take about 10 minutes, and an additional 20 minutes to obtain the user ratings data from the official API which is more lenient with its rate limiting in my experience.

4. Conclusion

In this notebook we have explore an improved implementation of our previous webscraping approach to obtaining content information and user ratings information from the website. This implementation cuts down the time taken to scrape our content information by over 90%, taking only 20 minutes to scrape the entire site now.

This targeted user approach improves the quality of user rating data that we will obtain. Previously we scraped user data indiscriminately as long as they are recently active, resulting in a lot of obscure titles missing from the dataset, and a lot of wasted API calls when the active user scraped does not maintain a useful rating list. The updated approach obtains the usernames directly from the Title’s recent interactions list, ensuring that the scraped user ratings data will at least contain information of the Title of interest.

Contents

1. Introduction

2. Seasonal Title Scraping

3. Active Users Scraping

4. Conclusion