Anime webscraping - Part 2

Introduction
Usernames Scraping
User Ratings List Scraping
Conclusion

1. Introduction

In this notebook we be scrape a popular anime database/community MyAnimeList, aiming to collect enough raw data from the anime titles available on the website for further processing and learning purposes.

In Part 1 we have scraped datasets containing information from the anime titles and reviews from the site through HTTP requests to the site and scraping the HTML pages. Here in Part 2 we will still be sending requests but we will include the use of the website’s official API as well.

The relevant python scripts and samples of the datasets can be found in the following repository.

import os
from bs4 import BeautifulSoup
import requests
import time
import pandas as pd
import numpy as np
import random
import csv

2. Usernames Scraping

As there are no easy way to obtain a list of users on the site, a few different approaches were considered.

Scrape anime titles’ webpages and retrieve lists of usernames that have recently added the title to their rating lists.
Scrape discussion threads and retrieve all usernames that have participated in the threads.
Scrape the users webpage periodically to gather a list of usernames that have been active recently.

We are proceeding with the third approach due to the below considerations

Approach 3 gives us users that are guaranteed to be active, the other approaches may return inactive users depending on the age of the discussion threads or the popularity of an anime title.
Approaches 1 and 2 may introduce more biases to the sample of usernames we scrape. For Approach 1 we have to consider how we sample the anime titles that we choose to scrape to prevent oversampling users that seek our specific genres. Approach 2 will also have the sample issue with sampling of discussion threads, along with removing all the users who do not take part in these threads.

Below image shows the user page that we will scrape, we can see that each time we send a request it should return a response of 20 recently active users. If we do this periodically we will gather a large enough pool of usernames over time. users

### Implement randomized sleep time in between requests to reduce chance of being blocked from site
def sleep(t):
    rand_t = random.random() * (t) + t
    time.sleep(rand_t)
    print(f"Sleeping for {rand_t}s")
    
# Helper function to write list values to csv as a new row
def write_new_row(file_name, l):
    with open(file_name,'a', encoding='utf-8') as f:
        writer = csv.writer(f, delimiter='|',lineterminator='\n')
        values = []
        for v in l:
            writer.writerow([v])

# Function to scrape usernames page
def get_data(link, req_head):
    for _ in range(3):
        try:
            sleep(0.5)
            data = requests.get(link, headers = req_head)
            if data.status_code == 403:
                print('-----------------------------403 error encountered, may have been rate limited or user list is restricted-----------------------------')
                #sleep(300)
                return None
            elif data.status_code != 200:
                print( f'-----------------------------{data.status_code} status code encountered-----------------------------')
                sleep(5)
                continue
            else:
                return data
        except:
            buffer_t = random.random() * (40) + 100
            sleep(buffer_t)
            continue
    print("-----------------------------Error getting request-----------------------------")
    print(time.asctime())

Similar to part 1, above are some self explanatory functions that are used.

# Function to extract username data from page
def extract_usernames(data, current_set):
    doc = BeautifulSoup(data.text)
    usernames = []
    for d in doc.find_all('td', class_='borderClass'):
        username = d.find('div').text
        if username not in current_set:
            usernames.append(username)
    return usernames

# Scrape usernames from the user page periodically
def scrape_users(req_head, file_name='usernames_list.csv', target=20000):
    current_set = set(pd.read_csv(file_name, delimiter='|', header=None).values.ravel())
    i = 0
    #req_head = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:124.0) Gecko/20100101 Firefox/124.0'}
    while i < target:
        data = get_data('https://myanimelist.net/users.php', req_head)
        usernames = extract_usernames(data, current_set)
        current_set.update(usernames)
        
        write_new_row(file_name, usernames)
        i = len(current_set)
        print(f'Current number of usernames found: {i}')

Above we see the functions used for scraping the users page. The first function extracts the usernames from the response html while the second function contains the overall logic to periodically scrape the webpage until enough unique usernames are obtained. A set is used to track and identify unique usernames.

usernames = pd.read_csv('usernames.csv', delimiter='|', header=None).iloc[:,0]
usernames

              flerbz
             ArnauCP
       KelvinKanbaru
        vian11231123
      Cidkagenou1997
              ...      
       Dolphtaro
        AsUkA234
        Josetrox
        grabahan
       mintcakee
Name: 0, Length: 20011, dtype: object

Above we see that we have successfully scraped 20010 active unique usernames.

3. User Ratings List Scraping

With the usernames available, we are able to scrape their anime ratings list for the desired information. Below is a screenshot of what such as list may look like. animelist

# Extract json details into a list of dict
def get_anime_list(data, user_name, pos, ratings_list):
    for i in range(len(data.json()['data'])):
        json = data.json()['data'][i]
        rating_entry = {
            "Username" : user_name,
            "User_Id" : pos,
            "Anime_Id" : json['node'].get('id', np.nan),
            "Anime_Title" : json['node'].get('title', np.nan),
            "Rating_Status" : json['list_status'].get('status', np.nan),
            "Rating_Score" : json['list_status'].get('score', np.nan),
            "Num_Epi_Watched" : json['list_status'].get('num_episodes_watched', np.nan),
            "Is_Rewatching" : json['list_status'].get('is_rewatching', np.nan),
            "Updated" : json['list_status'].get('updated_at', np.nan),
            "Start_Date" : json['list_status'].get('start_date', np.nan)
        }
        ratings_list.append(rating_entry)
    return ratings_list

# Helper function to write dict to csv as a new row
def write_new_row_dict(file_name, d):
    if not file_name in os.listdir():
        with open(file_name,'w', encoding='utf-8') as f:
            writer = csv.writer(f, delimiter='|',lineterminator='\n')
            headers = list(d[0].keys())
            writer.writerow(headers)
    with open(file_name,'a', encoding='utf-8') as f:
        writer = csv.writer(f, delimiter='|',lineterminator='\n')
        for i in range(1,len(d)):
            values = []
            for k, v in d[i].items():
                values.append(str(v))
            writer.writerow(values)

#  Scrape anime list information of each username within the list of usernames
def scrape_user_animelist(usernames, req_head, pos, log_file, output_file):
    curr = 0 # track consecutive skipped users, to differentiate rate limiting vs user's restricted list
    while pos < len(usernames):
        # if 403 error encountered more than 3 times in a row, sleep for ~5 minutes due to suspected rate limiting
        if curr > 3:
            print("Suspected rate limiting, pausing for a few minutes")
            sleep(240)
        username = usernames[pos]
        animelist_link = f'https://api.myanimelist.net/v2/users/{username}/animelist?limit=500&nsfw=true&fields=list_status'
        ratings_list= []
        data = get_data(animelist_link, req_head)
        
        # log the users that were skipped due to 403 error, this can happen if website rate limits us or if the user has chosen to keep their list private/restricted.
        if data is None:
            print(f'Current number of usernames processed: {pos} / {len(usernames)}')
            print(f'Skipping user {pos} as rate limited or user list is restricted')
            write_new_row_dict('skipped_users_list.csv', [{'pos':pos, "username":username}])
            curr += 1
            pos += 1
            continue
        curr = 0
        ratings_list = get_anime_list(data, usernames[pos], pos, ratings_list)
        if len(ratings_list):
            write_new_row_dict('user_ratings.csv', ratings_list)
        
        print(f'Current number of usernames processed: {pos} / {len(usernames)}')
        pos += 1

Once we have a list of usernames, we will be calling the website’s official API to retrieve the anime rating lists from each of these users. Similar to how we did it for anime titles, the information for each user will be stored in dictionaries.

When requesting users that have restricted access to their anime rating lists, a 403 status code will be returned by the API. This is the same response given if we get rate limited when using the API. To prevent the script from unnecessarily pausing when encountering restricted usernames, we placed a condition so that time delay to let any rate limitation expire will only trigger after encountering 403 response 4 times in a row. These 403 errors are also logged for further investigation if requried.

With the official API, retrieving data is a lot simpler as everything is structured properly within the json, avoiding the need to make sense of the webpage’s html and identify which tags to retrieve.

user_ratings = pd.read_csv('user_ratings.csv', delimiter='|')
user_ratings

	Username	User_Id	Anime_Id	Anime_Title	Rating_Status	Rating_Score	Num_Epi_Watched	Is_Rewatching	Updated	Start_Date
0	flerbz	0	30654	Ansatsu Kyoushitsu 2nd Season	watching	0	24	False	2022-02-26T22:15:01+00:00	2022-01-29
1	flerbz	0	22789	Barakamon	dropped	0	2	False	2023-01-28T19:03:33+00:00	2022-04-06
2	flerbz	0	31964	Boku no Hero Academia	completed	0	13	False	2024-03-31T02:10:32+00:00	2024-03-30
3	flerbz	0	33486	Boku no Hero Academia 2nd Season	completed	0	25	False	2024-03-31T22:32:02+00:00	2024-03-30
4	flerbz	0	36456	Boku no Hero Academia 3rd Season	watching	0	24	False	2024-04-03T02:08:56+00:00	2024-03-31
...	...	...	...	...	...	...	...	...	...	...
5452187	mintcakee	20010	392	Yuu☆Yuu☆Hakusho	plan_to_watch	0	0	False	2023-03-09T13:18:23+00:00	NaN
5452188	mintcakee	20010	1246	Yuugo: Koushounin	plan_to_watch	0	0	False	2023-10-23T14:14:44+00:00	NaN
5452189	mintcakee	20010	23283	Zankyou no Terror	plan_to_watch	0	0	False	2022-12-29T02:18:00+00:00	NaN
5452190	mintcakee	20010	37976	Zombieland Saga	completed	7	12	False	2023-04-24T14:35:42+00:00	NaN
5452191	mintcakee	20010	40174	Zombieland Saga Revenge	completed	8	12	False	2023-04-24T14:35:46+00:00	NaN

5452192 rows × 10 columns

Our final scraped user ratings list data contains over 5.4 million entries. Before even accounting for the requests that returned a 403 error we see that on average each user has about 270 titles within their list! With this information we may be able to form some sort of collaborative recommendation system where we use these ratings to gauge how well another user may like a title they have not watched.

4. Conclusion

In this notebook we went through the process of scraping a dataset containing ~20000 users and their ratings for anime titles that they have watched. Using this as raw data, further analysis and applications can be explored.

Contents

1. Introduction

2. Usernames Scraping

3. User Ratings List Scraping

4. Conclusion