Anime webscraping - Part 1

Introduction
Title Scraping
Additional Information Scraping
Conclusion

1. Introduction

In this notebook we will be scraping a popular anime database/community MyAnimeList, aiming to collect enough raw data from the anime titles available on the website for further processing and learning purposes.

The relevant python scripts and samples of the datasets can be found in the following repository.

import os
from bs4 import BeautifulSoup
import requests
import time
import pandas as pd
import random
import re
import csv

2. Title Scraping

We will start by scraping high level information from anime titles that have been rated on the site. From the webpage we can guess what sort of information we can retrieve without accessing the detailed webpages for specific titles. As seen in the image below we will most likely be able to scrape Rank,Title,Score,Type,Airing Period,Members information. topanime

# Get request from site
site_url = 'https://myanimelist.net'
top_anime_url = site_url + '/topanime.php?limit='
response = requests.get(top_anime_url + '0')
response.status_code

Above is a quick sanity check to show that we are able to get a desired response from our GET request to the topanime webpage.

# Extract html information from the webpage
doc = BeautifulSoup(response.text)

# Extract relevant portion of the webpage
row_contents = doc.find_all('tr', {'class':'ranking-list'})

row_contents[0]

<tr class="ranking-list">
<td class="rank ac" valign="top">
<span class="lightLink top-anime-rank-text rank1">1</span>
</td>
<td class="title al va-t word-break">
<a class="hoverinfo_trigger fl-l ml12 mr8" href="https://myanimelist.net/anime/52991/Sousou_no_Frieren" id="#area52991" rel="#info52991">
<img alt="Anime: Sousou no Frieren" border="0" class="lazyload" data-src="https://cdn.myanimelist.net/r/50x70/images/anime/1015/138006.jpg?s=09c2f2dec5891d8e8fbb9fa3b23c75b4" data-srcset="https://cdn.myanimelist.net/r/50x70/images/anime/1015/138006.jpg?s=09c2f2dec5891d8e8fbb9fa3b23c75b4 1x, https://cdn.myanimelist.net/r/100x140/images/anime/1015/138006.jpg?s=fdca2fe2777421f4c3aaa56a6ba8a46f 2x" height="70" width="50"/>
</a>
<div class="detail"><div id="area52991">
<div class="hoverinfo" id="info52991" rel="a52991"></div>
</div>
<div class="di-ib clearfix"><h3 class="fl-l fs14 fw-b anime_ranking_h3"><a class="hoverinfo_trigger" href="https://myanimelist.net/anime/52991/Sousou_no_Frieren" id="#area52991" rel="#info52991">Sousou no Frieren</a></h3><div class="icon-watch-pv2"><a class="mal-icon ml8 ga-click" href="https://myanimelist.net/anime/52991/Sousou_no_Frieren/video" title="Watch Promotional Video"><i class="malicon malicon-movie-pv"></i></a></div></div><br/><div class="information di-ib mt4">
        TV (28 eps)<br/>
        Sep 2023 - Mar 2024<br/>
        683,910 members
      </div></div>
</td>
<td class="score ac fs14"><div class="js-top-ranking-score-col di-ib al"><i class="icon-score-star fa-solid fa-star mr4 on"></i><span class="text on score-label score-9">9.39</span></div>
</td>
<td class="your-score ac fs14">
<div class="js-top-ranking-your-score-col di-ib al"> <a class="ga-impression" data-ga-click-type="data-ga-impression-type=" href="https://myanimelist.net/login.php?error=login_required&amp;from=%2Ftopanime.php%3Flimit%3D0" onclick="dataLayer.push({'event':'ga-js-event','ga-js-event-type':''})"><i class="icon-score-star fa-solid fa-star mr4"></i><span class="text score-label score-na">N/A</span></a>
</div>
</td>
<td class="status ac"> <a class="js-form-user-status js-form-user-status-btn Lightbox_AddEdit btn-addEdit-large btn-anime-watch-status js-anime-watch-status notinmylist ga-impression" data-ga-click-type="anime_ranking" data-ga-impression-type="anime_ranking" href="https://myanimelist.net/ownlist/anime/add?selected_series_id=52991&amp;hideLayout=1&amp;click_type=anime_ranking" onclick="dataLayer.push({'event':'ga-js-event','ga-js-event-type':'anime_ranking'})">Add to My List</a></td>
</tr>

Using the BeautifulSoup4 package we are able to easily parse the response html, correctly identifying the portion of the response that contains the relevant information. In the above html sample we see the portion of the response that corresponds to the highest ranking title “Sousou no Frieren” on the website and the other information that we have predicted at the start of this notebook.

# Helper functions
### Implement randomized sleep time in between requests to reduce chance of being blocked from site
def sleep(3):
    rand_t = random.random() * (t) + 0.5
    time.sleep(rand_t)
    print(f"Sleeping for {rand_t}s")

### Clean up extracted text information
def parse_episodes(content):
    result = []
    for i in content:
        r = i.strip()
        result.append(r)
    return result

### Return only numeric characters from a string
def return_numeric(string):
    try:
        text = re.findall("\d+", string)[0]
    except IndexError:
        text = '?'
    return text
    
### Save our dictionary to a .csv file
def write_csv(items, path):
    # Open the file in write mode
    with open(path, 'w', encoding='utf-8') as f:
        # Return if there's nothing to write
        if len(items) == 0:
            return
        
        # Write the headers in the first line
        headers = list(items[0].keys())
        f.write(','.join(headers) + '\n')
        
        # Write one item per line
        for item in items:
            values = []
            for header in headers:
                values.append(str(item.get(header, "")).replace(',',' '))
            f.write(','.join(values) + "\n")          

Above are some helper self-explanatory functions that we will be using when scraping the website.

# Extract high level information from row_contents
def extract_info(top_anime, row_contents):
    stop = False
    for i in range(len(row_contents)):
        episode = parse_episodes(row_contents[i].find('div', class_ = "information di-ib mt4").text.strip().split('\n'))
        id_str = row_contents[i].find('td', class_='title al va-t word-break').find('a')['id']
        ranking = {
            'Id' : return_numeric(id_str),
            'Rank' : row_contents[i].find('td', class_ = "rank ac").find('span').text,
            'Title': row_contents[i].find('div', class_="di-ib clearfix").find('a').text,
            'Rating': row_contents[i].find('td', class_="score ac fs14").find('span').text,
            'Image_URL': row_contents[i].find('td', class_ ='title al va-t word-break').find('img')['data-src'],
            'Type' : episode[0].split('(')[0].strip(),
            'Episodes': return_numeric(episode[0].split('(')[1]),
            'Dates': episode[1],
            'Members': return_numeric(episode[2])
        }
        top_anime.append(ranking)
        if ranking['Rating']=='N/A':
            stop = True
    return top_anime, stop

The above function helps to process the response html, storing each title’s information within a dictionary, and storing all the dictionaries in a list where they wait for their turn to be written to disk.

Information is parsed mainly with the help of BeautifulSoup, the previously seen helper functions are also used here when some of the parsed data may require additional cleaning before being added to the dictionary.

An early stopping criteria is also built into this function. As the webpage’s title scores are sorted in a descending order, it is reasonable to assume that titles missing a “Rating” score have not aired or are too obscure. The script will stop scraping when it detects the first title that does not contain a “Rating” score.

# Loop to scrape top anime pages, stop when non-rated title is found
def scrape_top_anime(file_name, t):
    top_anime = []
    stop = False
    counts = 0
    while not stop:
        sleep(3)
        response = requests.get(top_anime_url + str(counts))
        print(f"Current counts: {counts}, Request Status: {response.status_code}")
        while response.status_code != 200:
            sleep()
            response = requests.get(top_anime_url + str(counts))
        doc = BeautifulSoup(response.text)
        row_contents = doc.find_all('tr', {'class':'ranking-list'})
        top_anime, stop = extract_info(top_anime, row_contents)
        counts += 50
    
    write_csv(top_anime, file_name)

The function contains the overall logic used when scraping the topanime section of the site, utilizing the functions from before to extract information before writing everything to disk.

df = pd.read_csv('top_anime_list.csv')
df

	Id	Rank	Title	Rating	Image_URL	Type	Episodes	Dates	Members
0	52991	1	Sousou no Frieren	9.39	https://cdn.myanimelist.net/r/50x70/images/ani...	TV	28	Sep 2023 - Mar 2024	670
1	5114	2	Fullmetal Alchemist: Brotherhood	9.09	https://cdn.myanimelist.net/r/50x70/images/ani...	TV	64	Apr 2009 - Jul 2010	3
2	9253	3	Steins;Gate	9.07	https://cdn.myanimelist.net/r/50x70/images/ani...	TV	24	Apr 2011 - Sep 2011	2
3	28977	4	Gintama°	9.06	https://cdn.myanimelist.net/r/50x70/images/ani...	TV	51	Apr 2015 - Mar 2016	628
4	38524	5	Shingeki no Kyojin Season 3 Part 2	9.05	https://cdn.myanimelist.net/r/50x70/images/ani...	TV	10	Apr 2019 - Jul 2019	2
...	...	...	...	...	...	...	...	...	...
13295	49369	13296	Shinkai no Survival!	NaN	https://cdn.myanimelist.net/r/50x70/images/ani...	Movie	1	Aug 2021 - Aug 2021	382
13296	57798	13297	Shinkalion: Change the World	NaN	https://cdn.myanimelist.net/r/50x70/images/ani...	TV	?	Apr 2024 -	2
13297	38114	13298	Shinkansen Henkei Robo Shinkalion The Animatio...	NaN	https://cdn.myanimelist.net/r/50x70/images/ani...	ONA	1	Aug 2018 - Aug 2018	457
13298	22313	13299	Shinken Densetsu: Tight Road	NaN	https://cdn.myanimelist.net/r/50x70/images/ani...	TV	13	Oct 1994 - Dec 1994	738
13299	34697	13300	Shinmai	NaN	https://cdn.myanimelist.net/r/50x70/images/ani...	Special	4	Apr 2008 - Apr 2008	234

13300 rows × 9 columns

df['Rating'].isna().sum()

For reference, the final file containing the scraped data is shown in the dataframe above. A total of 13300 titles were scraped, 16 of which contains missing “Rating” scores. This 16 titles were scraped as the script extracted data in batches of 50 titles.

3. Additional Information Scraping

The dataset obtained in the previous section provides some high level information across the many titles on the website. However, for further analysis and creation of recommendation systems we are interested in obtaining more detailed information of each title.

To achieve that, we can go beyond just the topanime section of the site, into the webpage of each title and their own subpages to extract even more information.

In the screenshot below we can see an abundance of information we can extract from the title’s webpage, towards the left there is a sidebar containing additional information and statistics of the title. At the top there are links to subpages that contains even more information of the title. Our goal in this section would be to identify and scrape relevant data to flesh out our dataset. We will do this for all 13300 titles found in the previous section frieren

# Retrieve webpage url of specific pages for a title
def get_link_by_text(soup, anime_id, text):
    urls = list(filter(lambda x: str(anime_id) in x["href"], soup.find_all("a", text=text)))
    return urls[0]["href"]

The above function retrieve the url of specific subpages of the title, allowing us to send follow up requests and extract data from these subpages.

# Helper function to try get request; if fail 3 times log the title id in .csv file
def get_request(link, req_head, anime_id):
    for _ in range(3):
        try:
            data = requests.get(link, headers=req_head)
            if data.status_code !=200:
                sleep()
                continue
            else:
                return data
        except:
            buffer_t = random.random() * (40) + 100
            time.sleep(buffer_t)
            continue
    print(f"Error with Title Id {anime_id}")
    if not 'log_id.csv' in os.listdir():
        with open('log_id.csv','w', encoding='utf-8') as f:
            writer = csv.writer(f, delimiter='|',lineterminator='\n')
            headers = ['MAL_Id', 'URL']
            writer.writerow([anime_id, link])
    with open('log_id.csv','a', encoding='utf-8') as f:
        writer = csv.writer(f, delimiter='|',lineterminator='\n')
        writer.writerow([anime_id, link])

The function above contains some logic to handle exceptions or error codes when sending our GET requests. If we suspect any rate limiting from the website’s end we will pause our requests for approximately 2 minutes to avoid getting banned. Should no proper response be received after 3 requests for the title we will log the title and subpage url into a .csv file for further investigation.

# Extract 1st page of reviews
def get_reviews(link, anime_id):
    sleep()
    review_link = f"{link}?p=1"
    #data = requests.get(review_link, header=req_head)
    data = get_request(review_link, req_head, anime_id)
    if data is None:
        return ['Error'],['Error']
    soup = BeautifulSoup(data.text, "html.parser")
    tags = soup.find_all("div", class_ = "tags")
    reviews = soup.find_all("div", class_="text")
    return tags, reviews

# Function to format reviews+tags
def get_review_tags(soup_tags, soup_reviews, anime_id):
    extra_tags = ['Funny','Informative','Well-written','Creative','Preliminary']
    review_tags = []
    output = []
    soup_reviews = [r.get_text() for r in soup_reviews]
    for soup_tag in soup_tags:
        curr_tags = []
        tags = soup_tag.text
        #tags = re.findall('[A-Z][^A-Z]*', tags)
        if 'Not' in tags:
            curr_tags.append("Not-Recommended")
        elif "Mixed" in tags:
            curr_tags.append("Mixed-Feelings")
        else:
            curr_tags.append("Recommended")
        for tag in extra_tags:
            if tag in tags:
                curr_tags.append(tag)
        review_tags.append(curr_tags)
    rt =  list(zip(soup_reviews, review_tags))    
    for row in rt:
        r, t = row
        output.append([anime_id, r, t])
    return output

# Helper function to write review/tags to csv file
def write_new_reviews(file_name, l):
    if not l:
        return
    if not file_name in os.listdir():
        with open(file_name,'w', encoding='utf-8') as f:
            writer = csv.writer(f, delimiter='|',lineterminator='\n')
            headers = ['MAL_Id','Review','Tags']
            writer.writerow(headers)
    with open(file_name,'a', encoding='utf-8') as f:
        writer = csv.writer(f, delimiter='|',lineterminator='\n')
        for row in l:
            writer.writerow(row)

The “Reviews” subpage requires some formatting/cleaning before writing the data to disk. The above functions were made to handle this subpage.

The first function retrieves the first page of reviews from the title, returning lists of all the reviews and their corresponding tags found, handling of titles with zero reviews is also done here. The next two functions processes the scraped data and write them to disk.

# Extract recommended anime title and number of recommendations
def get_recs(link_recommendations, anime_id):
    sleep()
    #data = requests.get(link_recomendations, header=req_head)
    data = get_request(link_recommendations, req_head, anime_id)
    if data is None:
        return ['Error'],['Error']
    soup = BeautifulSoup(data.text, "html.parser")
    soup.script.decompose()
    rec_ids = []
    rec_counts = []
    soup_ids = soup.find_all('div', {'class':'hoverinfo'})
    soup_rec_counts = soup.find_all('a', {'class':'js-similar-recommendations-button'})
    for i in range(len(soup_ids)):
        rec_id = return_numeric(soup_ids[i]['rel'])
        rec_ids.append(rec_id)
        if i < len(soup_rec_counts):
            rec_counts.append(soup_rec_counts[i].find('strong').text)
        else:
            rec_counts.append('1')
    return rec_ids, rec_counts

Next, we have a function to scrape the “Recommendations” subpage, identifying the recommended titles and the number of recommendations they have received. Handling of missing data in the case where no other titles are recommended is also done here.

# Extract title details and statistics
def scrape_anime_info(link_stats, anime_id, anime_info):
    # Get webpage
    #data = requests.get(link_stats, header=req_head)
    data = get_request(link_stats, req_head, anime_id)
    if data is None:
        return anime_info
    soup = BeautifulSoup(data.text, "html.parser")
    soup.script.decompose()
    
    # Scrape and store information in dict
    anime_info["MAL_Id"] = anime_id
    anime_info["Name"] = soup.find("h1", {"class": "title-name h1_bold_none"}).text.strip()

    score = soup.find("span", {"itemprop": "ratingValue"})
    if score is None:
        score = '?'
    try:
        anime_info['Score'] = score.text.strip()
    except:
        print('Empty Score')
        
    anime_info['Genres'] = [x.text.strip() for x in soup.findAll("span", {"itemprop": "genre"})]
    try:
        anime_info['Demographic'] = anime_info['Genres'][-1]
    except:
        print('Empty Genre')

    for s in soup.findAll("span", {"class": "dark_text"}):
        info = [x.strip().replace(" ", " ") for x in s.parent.text.split(":")]
        cat, v = info[0], ":".join(info[1:])
        v.replace("\t", "")
        
        if cat in ['Synonyms','Japanese','English']:
            cat += '_Name'
            v = v.replace(',', '')
            anime_info[cat] = v
            continue
        if cat in ['Broadcast','Genres','Demographic','Score'] or cat not in anime_info.keys():
            continue
        elif cat in ['Producers','Licensors','Studios']:
            v = [x.strip() for x in v.split(",")]
        elif cat in ['Ranked','Popularity']:
            v = v.replace('#',"")
            v = v.replace(',', '')
        elif cat in ['Members','Favorites','Watching','Completed','On-Hold','Dropped','Plan to Watch','Total']:
            v = v.replace(',','')
            
        anime_info[cat] = v

    # Scrape scoring stats
    for s in soup.find("div", {"id": "horiznav_nav"}).parent.findAll(
        "div", {"class": "updatesBar"}):
        cat = f"Score-{s.parent.parent.parent.find('td', class_='score-label').text}"
        v = ([x.strip() for x in s.parent.text.split("%")][-1].strip("(votes)"))
        anime_info[cat] = str(v).strip()
    return anime_info

# Helper function to write dict to csv as a new row
def write_new_row(file_name, d):
    if not file_name in os.listdir():
        with open(file_name,'w', encoding='utf-8') as f:
            writer = csv.writer(f, delimiter='|',lineterminator='\n')
            headers = list(d.keys())
            writer.writerow(headers)
    with open(file_name,'a', encoding='utf-8') as f:
        writer = csv.writer(f, delimiter='|',lineterminator='\n')
        values = []
        for k, v in d.items():
            values.append(str(v))
        writer.writerow(values)

# Scrape various information from the anime title through the links to its webpages
def scrape_anime(anime_id):
    #path = f"{HTML_PATH}/{anime_id}"
    #if f"{anime_id}.zip" in os.listdir(f'{HTML_PATH}'):
    #    return
    
    #os.makedirs(path, exist_ok=True)
    sleep()
    #data = requests.get(f"https://myanimelist.net/anime/{anime_id}", header=req_head)
    data = get_request(f"https://myanimelist.net/anime/{anime_id}", req_head, anime_id)
    if data is None:
        return
    
    soup = BeautifulSoup(data.text, "html.parser")
    soup.script.decompose()
    va = []
    for s in soup.find_all('td', class_='va-t ar pl4 pr4'):
        va.append(s.a.text)
    #save(f"{HTML_PATH}/{anime_id}/details.html", soup.prettify())
    
    # Get urls to detailed webpages
    link_review = get_link_by_text(soup, anime_id, "Reviews")
    link_recommendations = get_link_by_text(soup, anime_id, "Recommendations")
    link_stats = get_link_by_text(soup, anime_id, "Stats")
    #link_staff = get_link_by_text(soup, anime_id, "Characters & Staff")
    
    # Dict to store information
    key_list = ['MAL_Id','Name','Synonyms_Name','Japanese_Name','English_Name','Type','Episodes','Status','Aired','Premiered','Producers','Licensors','Studios','Source','Genres','Demographic','Duration','Rating','Score','Ranked','Popularity','Members','Favorites','Watching','Completed','On-Hold','Dropped','Plan to Watch','Total','Score-10','Score-9', 'Score-8', 'Score-7', 'Score-6', 'Score-5', 'Score-4','Score-3', 'Score-2', 'Score-1','Synopsis','Voice_Actors','Recommended_Ids','Recommended_Counts']
    anime_info = {key:'?' for key in key_list}
    
    # Scrape relevant information from the urls
    anime_info = scrape_anime_info(link_stats, anime_id, anime_info)
    anime_info['Synopsis'] = soup.find('p', {'itemprop':'description'}).text.replace('\r','').replace('\n','').replace('\t','')    
    anime_info['Voice_Actors'] = va
    rec_ids, rec_counts = get_recs(link_recommendations, anime_id)
    anime_info['Recommended_Ids'] = rec_ids
    anime_info['Recommended_Counts'] = rec_counts
    write_new_row('anime_info.csv', anime_info)
    
    soup_tags, soup_reviews = get_reviews(link_review, anime_id)
    if len(soup_tags) > 0 and len(soup_reviews) > 0:
        review_data = get_review_tags(soup_tags, soup_reviews, anime_id)
        write_new_reviews('anime_reviews.csv', review_data)
         
def scrape_all_anime_info(anime_list_file_name, i):
    df = pd.read_csv(anime_list_file_name)
    for aid in df.Id[i:]:
        scrape_anime(aid)
        i+=1
        print(f'Latest Title: {aid}, Title Completed: {i}/13300')
        if not i%20:
            print(time.asctime())

Finally, we have our functions that bring everything together. Using BeautifulSoup the relevant information within html tags are retrieved from the subpages and stored within a dictionary for each title. The dictionaries are then written onto disk.

df1 = pd.read_csv('anime_info.csv', on_bad_lines='warn', delimiter='|')
df1

	MAL_Id	Name	Synonyms_Name	Japanese_Name	English_Name	Type	Episodes	Status	Aired	Premiered	...	Score-6	Score-5	Score-4	Score-3	Score-2	Score-1	Synopsis	Voice_Actors	Recommended_Ids	Recommended_Counts
0	52991	Sousou no Frieren	Frieren at the Funeral	葬送のフリーレン	Frieren:Beyond Journey's End	TV	28	Finished Airing	Sep 29, 2023 to Mar 22, 2024	Fall 2023	...	3191	1726	734	426	402	4100	During their decade-long quest to defeat the D...	['Tanezaki, Atsumi', 'Ichinose, Kana', 'Kobaya...	['33352', '41025', '35851', '486', '457', '296...	['14', '11', '8', '5', '5', '4', '4', '3', '2'...
1	5114	Fullmetal Alchemist: Brotherhood	Hagane no Renkinjutsushi:Fullmetal Alchemist F...	鋼の錬金術師 FULLMETAL ALCHEMIST	Fullmetal Alchemist:Brotherhood	TV	64	Finished Airing	Apr 5, 2009 to Jul 4, 2010	Spring 2009	...	31930	15538	5656	2763	3460	50602	After a horrific alchemy experiment goes wrong...	['Park, Romi', 'Kugimiya, Rie', 'Miki, Shinich...	['11061', '16498', '1482', '38000', '9919', '1...	['74', '44', '21', '17', '16', '14', '14', '9'...
2	9253	Steins;Gate	?	STEINS;GATE	Steins;Gate	TV	24	Finished Airing	Apr 6, 2011 to Sep 14, 2011	Spring 2011	...	31520	16580	8023	3740	2868	10054	Eccentric scientist Rintarou Okabe has a never...	['Miyano, Mamoru', 'Imai, Asami', 'Hanazawa, K...	['31043', '31240', '9756', '10620', '2236', '4...	['132', '130', '48', '26', '24', '19', '19', '...
3	28977	Gintama°	Gintama' (2015)	銀魂°	Gintama Season 4	TV	51	Finished Airing	Apr 8, 2015 to Mar 30, 2016	Spring 2015	...	6060	3601	1496	1011	1477	8616	Gintoki, Shinpachi, and Kagura return as the f...	['Sugita, Tomokazu', 'Kugimiya, Rie', 'Sakaguc...	['9863', '30276', '33255', '37105', '6347', '3...	['3', '2', '1', '1', '1', '1', '1', '1', '1', ...
4	38524	Shingeki no Kyojin Season 3 Part 2	?	進撃の巨人 Season3 Part.2	Attack on Titan Season 3 Part 2	TV	10	Finished Airing	Apr 29, 2019 to Jul 1, 2019	Spring 2019	...	22287	8112	3186	1596	1308	12803	Seeking to restore humanity's diminishing hope...	['Kamiya, Hiroshi', 'Kaji, Yuuki', 'Ishikawa, ...	['28623', '37521', '25781', '2904', '36649', '...	['1', '1', '1', '1', '1', '1', '1', '1', '1', ...
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
13295	49369	Shinkai no Survival!	Kagaku Manga Survival	深海のサバイバル！	?	Movie	1	Finished Airing	Aug 13, 2021	?	...	7	5	2	3	2	7	Second movie of Kagaku Manga Surivival learnin...	[]	[]	[]
13296	57798	Shinkalion: Change the World	?	シンカリオンチェンジザワールド	?	TV	Unknown	Currently Airing	Apr 7, 2024 to ?	Spring 2024	...	31	34	15	7	3	5	Once upon a time, an unidentified enemy, the U...	['Tsuchiya, Shinba', 'Ono, Kensho', 'Ishibashi...	[]	[]
13297	38114	Shinkansen Henkei Robo Shinkalion The Animatio...	Shinkansen Henkei Robo Shinkalion:Soushuuhen -...	【新幹線変形ロボシンカリオン】総集編『団らん!!速杉家とシンカリオン』	Bullet Train Transforming Robot Shinkalion The...	ONA	1	Finished Airing	Aug 2, 2018	?	...	16	24	6	3	3	20	No synopsis information has been added to this...	[]	[]	[]
13298	22313	Shinken Densetsu: Tight Road	True Fist Legend	真拳伝説タイトロード	?	TV	13	Finished Airing	Oct 7, 1994 to Dec 28, 1994	Fall 1994	...	12	14	8	3	2	9	No synopsis information has been added to this...	['Kamiyama, Masami']	[]	[]
13299	34697	Shinmai	?	新米	?	Special	4	Finished Airing	Apr 15, 2008	?	...	2	9	6	11	7	6	Everyone is brand new (shinmai) at the beginni...	[]	[]	[]

13300 rows × 43 columns

The above dataframe shows the final scraped dataset for all 13300 titles, containing significantly more information than what we had in the previous section.

df2 = pd.read_csv('anime_reviews.csv', delimiter='|', on_bad_lines='warn')
df2

	MAL_Id	Review	Tags
0	52991	\r\n With lives so short, why d...	['Recommended', 'Preliminary']
1	52991	\r\n Frieren is the most overra...	['Not-Recommended', 'Funny', 'Preliminary']
2	52991	\r\n I feel so catered to.\r\n\...	['Recommended']
3	52991	\r\n Style-\r\r\nFrieren doesn'...	['Not-Recommended', 'Funny']
4	52991	\r\n Through 3 episodes, Friere...	['Mixed-Feelings', 'Preliminary']
...	...	...	...
77912	3287	\r\n Anime is and always has be...	['Not-Recommended']
77913	3287	\r\n If you've come to watch a ...	['Not-Recommended']
77914	3287	\r\n Giant Sqid Thingy is muh w...	['Recommended']
77915	3287	\r\n "It is not the fault of th...	['Recommended']
77916	3287	\r\n "Tenkuu Danzai Skelter+Hea...	['Recommended']

77917 rows × 3 columns

The above dataframe shows the scraped reviews data with their corresponding title id, allowing the possibility of using it for some text based collaborative recommendation system or some sentiment analysis in general.

4. Conclusion

In this notebook we have went through the process of scraping a sizeable dataset for future use. In the next part we will scrape user data from the same site using a different strategy.

Contents

1. Introduction

2. Title Scraping

3. Additional Information Scraping

4. Conclusion