Contents

  1. Introduction
  2. Title Scraping
  3. Additional Information Scraping
  4. Conclusion

1. Introduction

In this notebook we will be scraping a popular anime database/community MyAnimeList, aiming to collect enough raw data from the anime titles available on the website for further processing and learning purposes.

The relevant python scripts and samples of the datasets can be found in the following repository.

import os
from bs4 import BeautifulSoup
import requests
import time
import pandas as pd
import random
import re
import csv

2. Title Scraping

We will start by scraping high level information from anime titles that have been rated on the site. From the webpage we can guess what sort of information we can retrieve without accessing the detailed webpages for specific titles. As seen in the image below we will most likely be able to scrape Rank,Title,Score,Type,Airing Period,Members information. topanime

# Get request from site
site_url = 'https://myanimelist.net'
top_anime_url = site_url + '/topanime.php?limit='
response = requests.get(top_anime_url + '0')
response.status_code
200

Above is a quick sanity check to show that we are able to get a desired response from our GET request to the topanime webpage.

# Extract html information from the webpage
doc = BeautifulSoup(response.text)
# Extract relevant portion of the webpage
row_contents = doc.find_all('tr', {'class':'ranking-list'})
row_contents[0]
<tr class="ranking-list">
<td class="rank ac" valign="top">
<span class="lightLink top-anime-rank-text rank1">1</span>
</td>
<td class="title al va-t word-break">
<a class="hoverinfo_trigger fl-l ml12 mr8" href="https://myanimelist.net/anime/52991/Sousou_no_Frieren" id="#area52991" rel="#info52991">
<img alt="Anime: Sousou no Frieren" border="0" class="lazyload" data-src="https://cdn.myanimelist.net/r/50x70/images/anime/1015/138006.jpg?s=09c2f2dec5891d8e8fbb9fa3b23c75b4" data-srcset="https://cdn.myanimelist.net/r/50x70/images/anime/1015/138006.jpg?s=09c2f2dec5891d8e8fbb9fa3b23c75b4 1x, https://cdn.myanimelist.net/r/100x140/images/anime/1015/138006.jpg?s=fdca2fe2777421f4c3aaa56a6ba8a46f 2x" height="70" width="50"/>
</a>
<div class="detail"><div id="area52991">
<div class="hoverinfo" id="info52991" rel="a52991"></div>
</div>
<div class="di-ib clearfix"><h3 class="fl-l fs14 fw-b anime_ranking_h3"><a class="hoverinfo_trigger" href="https://myanimelist.net/anime/52991/Sousou_no_Frieren" id="#area52991" rel="#info52991">Sousou no Frieren</a></h3><div class="icon-watch-pv2"><a class="mal-icon ml8 ga-click" href="https://myanimelist.net/anime/52991/Sousou_no_Frieren/video" title="Watch Promotional Video"><i class="malicon malicon-movie-pv"></i></a></div></div><br/><div class="information di-ib mt4">
        TV (28 eps)<br/>
        Sep 2023 - Mar 2024<br/>
        683,910 members
      </div></div>
</td>
<td class="score ac fs14"><div class="js-top-ranking-score-col di-ib al"><i class="icon-score-star fa-solid fa-star mr4 on"></i><span class="text on score-label score-9">9.39</span></div>
</td>
<td class="your-score ac fs14">
<div class="js-top-ranking-your-score-col di-ib al"> <a class="ga-impression" data-ga-click-type="data-ga-impression-type=" href="https://myanimelist.net/login.php?error=login_required&amp;from=%2Ftopanime.php%3Flimit%3D0" onclick="dataLayer.push({'event':'ga-js-event','ga-js-event-type':''})"><i class="icon-score-star fa-solid fa-star mr4"></i><span class="text score-label score-na">N/A</span></a>
</div>
</td>
<td class="status ac"> <a class="js-form-user-status js-form-user-status-btn Lightbox_AddEdit btn-addEdit-large btn-anime-watch-status js-anime-watch-status notinmylist ga-impression" data-ga-click-type="anime_ranking" data-ga-impression-type="anime_ranking" href="https://myanimelist.net/ownlist/anime/add?selected_series_id=52991&amp;hideLayout=1&amp;click_type=anime_ranking" onclick="dataLayer.push({'event':'ga-js-event','ga-js-event-type':'anime_ranking'})">Add to My List</a></td>
</tr>

Using the BeautifulSoup4 package we are able to easily parse the response html, correctly identifying the portion of the response that contains the relevant information. In the above html sample we see the portion of the response that corresponds to the highest ranking title “Sousou no Frieren” on the website and the other information that we have predicted at the start of this notebook.

# Helper functions
### Implement randomized sleep time in between requests to reduce chance of being blocked from site
def sleep(3):
    rand_t = random.random() * (t) + 0.5
    time.sleep(rand_t)
    print(f"Sleeping for {rand_t}s")

### Clean up extracted text information
def parse_episodes(content):
    result = []
    for i in content:
        r = i.strip()
        result.append(r)
    return result

### Return only numeric characters from a string
def return_numeric(string):
    try:
        text = re.findall("\d+", string)[0]
    except IndexError:
        text = '?'
    return text
    
### Save our dictionary to a .csv file
def write_csv(items, path):
    # Open the file in write mode
    with open(path, 'w', encoding='utf-8') as f:
        # Return if there's nothing to write
        if len(items) == 0:
            return
        
        # Write the headers in the first line
        headers = list(items[0].keys())
        f.write(','.join(headers) + '\n')
        
        # Write one item per line
        for item in items:
            values = []
            for header in headers:
                values.append(str(item.get(header, "")).replace(',',' '))
            f.write(','.join(values) + "\n")          

Above are some helper self-explanatory functions that we will be using when scraping the website.

# Extract high level information from row_contents
def extract_info(top_anime, row_contents):
    stop = False
    for i in range(len(row_contents)):
        episode = parse_episodes(row_contents[i].find('div', class_ = "information di-ib mt4").text.strip().split('\n'))
        id_str = row_contents[i].find('td', class_='title al va-t word-break').find('a')['id']
        ranking = {
            'Id' : return_numeric(id_str),
            'Rank' : row_contents[i].find('td', class_ = "rank ac").find('span').text,
            'Title': row_contents[i].find('div', class_="di-ib clearfix").find('a').text,
            'Rating': row_contents[i].find('td', class_="score ac fs14").find('span').text,
            'Image_URL': row_contents[i].find('td', class_ ='title al va-t word-break').find('img')['data-src'],
            'Type' : episode[0].split('(')[0].strip(),
            'Episodes': return_numeric(episode[0].split('(')[1]),
            'Dates': episode[1],
            'Members': return_numeric(episode[2])
        }
        top_anime.append(ranking)
        if ranking['Rating']=='N/A':
            stop = True
    return top_anime, stop

The above function helps to process the response html, storing each title’s information within a dictionary, and storing all the dictionaries in a list where they wait for their turn to be written to disk.

Information is parsed mainly with the help of BeautifulSoup, the previously seen helper functions are also used here when some of the parsed data may require additional cleaning before being added to the dictionary.

An early stopping criteria is also built into this function. As the webpage’s title scores are sorted in a descending order, it is reasonable to assume that titles missing a “Rating” score have not aired or are too obscure. The script will stop scraping when it detects the first title that does not contain a “Rating” score.

# Loop to scrape top anime pages, stop when non-rated title is found
def scrape_top_anime(file_name, t):
    top_anime = []
    stop = False
    counts = 0
    while not stop:
        sleep(3)
        response = requests.get(top_anime_url + str(counts))
        print(f"Current counts: {counts}, Request Status: {response.status_code}")
        while response.status_code != 200:
            sleep()
            response = requests.get(top_anime_url + str(counts))
        doc = BeautifulSoup(response.text)
        row_contents = doc.find_all('tr', {'class':'ranking-list'})
        top_anime, stop = extract_info(top_anime, row_contents)
        counts += 50
    
    write_csv(top_anime, file_name)

The function contains the overall logic used when scraping the topanime section of the site, utilizing the functions from before to extract information before writing everything to disk.

df = pd.read_csv('top_anime_list.csv')
df
Id Rank Title Rating Image_URL Type Episodes Dates Members
0 52991 1 Sousou no Frieren 9.39 https://cdn.myanimelist.net/r/50x70/images/ani... TV 28 Sep 2023 - Mar 2024 670
1 5114 2 Fullmetal Alchemist: Brotherhood 9.09 https://cdn.myanimelist.net/r/50x70/images/ani... TV 64 Apr 2009 - Jul 2010 3
2 9253 3 Steins;Gate 9.07 https://cdn.myanimelist.net/r/50x70/images/ani... TV 24 Apr 2011 - Sep 2011 2
3 28977 4 Gintama° 9.06 https://cdn.myanimelist.net/r/50x70/images/ani... TV 51 Apr 2015 - Mar 2016 628
4 38524 5 Shingeki no Kyojin Season 3 Part 2 9.05 https://cdn.myanimelist.net/r/50x70/images/ani... TV 10 Apr 2019 - Jul 2019 2
... ... ... ... ... ... ... ... ... ...
13295 49369 13296 Shinkai no Survival! NaN https://cdn.myanimelist.net/r/50x70/images/ani... Movie 1 Aug 2021 - Aug 2021 382
13296 57798 13297 Shinkalion: Change the World NaN https://cdn.myanimelist.net/r/50x70/images/ani... TV ? Apr 2024 - 2
13297 38114 13298 Shinkansen Henkei Robo Shinkalion The Animatio... NaN https://cdn.myanimelist.net/r/50x70/images/ani... ONA 1 Aug 2018 - Aug 2018 457
13298 22313 13299 Shinken Densetsu: Tight Road NaN https://cdn.myanimelist.net/r/50x70/images/ani... TV 13 Oct 1994 - Dec 1994 738
13299 34697 13300 Shinmai NaN https://cdn.myanimelist.net/r/50x70/images/ani... Special 4 Apr 2008 - Apr 2008 234

13300 rows × 9 columns

df['Rating'].isna().sum()
16

For reference, the final file containing the scraped data is shown in the dataframe above. A total of 13300 titles were scraped, 16 of which contains missing “Rating” scores. This 16 titles were scraped as the script extracted data in batches of 50 titles.

3. Additional Information Scraping

The dataset obtained in the previous section provides some high level information across the many titles on the website. However, for further analysis and creation of recommendation systems we are interested in obtaining more detailed information of each title.

To achieve that, we can go beyond just the topanime section of the site, into the webpage of each title and their own subpages to extract even more information.

In the screenshot below we can see an abundance of information we can extract from the title’s webpage, towards the left there is a sidebar containing additional information and statistics of the title. At the top there are links to subpages that contains even more information of the title. Our goal in this section would be to identify and scrape relevant data to flesh out our dataset. We will do this for all 13300 titles found in the previous section frieren

# Retrieve webpage url of specific pages for a title
def get_link_by_text(soup, anime_id, text):
    urls = list(filter(lambda x: str(anime_id) in x["href"], soup.find_all("a", text=text)))
    return urls[0]["href"]

The above function retrieve the url of specific subpages of the title, allowing us to send follow up requests and extract data from these subpages.

# Helper function to try get request; if fail 3 times log the title id in .csv file
def get_request(link, req_head, anime_id):
    for _ in range(3):
        try:
            data = requests.get(link, headers=req_head)
            if data.status_code !=200:
                sleep()
                continue
            else:
                return data
        except:
            buffer_t = random.random() * (40) + 100
            time.sleep(buffer_t)
            continue
    print(f"Error with Title Id {anime_id}")
    if not 'log_id.csv' in os.listdir():
        with open('log_id.csv','w', encoding='utf-8') as f:
            writer = csv.writer(f, delimiter='|',lineterminator='\n')
            headers = ['MAL_Id', 'URL']
            writer.writerow([anime_id, link])
    with open('log_id.csv','a', encoding='utf-8') as f:
        writer = csv.writer(f, delimiter='|',lineterminator='\n')
        writer.writerow([anime_id, link])

The function above contains some logic to handle exceptions or error codes when sending our GET requests. If we suspect any rate limiting from the website’s end we will pause our requests for approximately 2 minutes to avoid getting banned. Should no proper response be received after 3 requests for the title we will log the title and subpage url into a .csv file for further investigation.

# Extract 1st page of reviews
def get_reviews(link, anime_id):
    sleep()
    review_link = f"{link}?p=1"
    #data = requests.get(review_link, header=req_head)
    data = get_request(review_link, req_head, anime_id)
    if data is None:
        return ['Error'],['Error']
    soup = BeautifulSoup(data.text, "html.parser")
    tags = soup.find_all("div", class_ = "tags")
    reviews = soup.find_all("div", class_="text")
    return tags, reviews

# Function to format reviews+tags
def get_review_tags(soup_tags, soup_reviews, anime_id):
    extra_tags = ['Funny','Informative','Well-written','Creative','Preliminary']
    review_tags = []
    output = []
    soup_reviews = [r.get_text() for r in soup_reviews]
    for soup_tag in soup_tags:
        curr_tags = []
        tags = soup_tag.text
        #tags = re.findall('[A-Z][^A-Z]*', tags)
        if 'Not' in tags:
            curr_tags.append("Not-Recommended")
        elif "Mixed" in tags:
            curr_tags.append("Mixed-Feelings")
        else:
            curr_tags.append("Recommended")
        for tag in extra_tags:
            if tag in tags:
                curr_tags.append(tag)
        review_tags.append(curr_tags)
    rt =  list(zip(soup_reviews, review_tags))    
    for row in rt:
        r, t = row
        output.append([anime_id, r, t])
    return output

# Helper function to write review/tags to csv file
def write_new_reviews(file_name, l):
    if not l:
        return
    if not file_name in os.listdir():
        with open(file_name,'w', encoding='utf-8') as f:
            writer = csv.writer(f, delimiter='|',lineterminator='\n')
            headers = ['MAL_Id','Review','Tags']
            writer.writerow(headers)
    with open(file_name,'a', encoding='utf-8') as f:
        writer = csv.writer(f, delimiter='|',lineterminator='\n')
        for row in l:
            writer.writerow(row)

The “Reviews” subpage requires some formatting/cleaning before writing the data to disk. The above functions were made to handle this subpage.

The first function retrieves the first page of reviews from the title, returning lists of all the reviews and their corresponding tags found, handling of titles with zero reviews is also done here. The next two functions processes the scraped data and write them to disk.

# Extract recommended anime title and number of recommendations
def get_recs(link_recommendations, anime_id):
    sleep()
    #data = requests.get(link_recomendations, header=req_head)
    data = get_request(link_recommendations, req_head, anime_id)
    if data is None:
        return ['Error'],['Error']
    soup = BeautifulSoup(data.text, "html.parser")
    soup.script.decompose()
    rec_ids = []
    rec_counts = []
    soup_ids = soup.find_all('div', {'class':'hoverinfo'})
    soup_rec_counts = soup.find_all('a', {'class':'js-similar-recommendations-button'})
    for i in range(len(soup_ids)):
        rec_id = return_numeric(soup_ids[i]['rel'])
        rec_ids.append(rec_id)
        if i < len(soup_rec_counts):
            rec_counts.append(soup_rec_counts[i].find('strong').text)
        else:
            rec_counts.append('1')
    return rec_ids, rec_counts

Next, we have a function to scrape the “Recommendations” subpage, identifying the recommended titles and the number of recommendations they have received. Handling of missing data in the case where no other titles are recommended is also done here.

# Extract title details and statistics
def scrape_anime_info(link_stats, anime_id, anime_info):
    # Get webpage
    #data = requests.get(link_stats, header=req_head)
    data = get_request(link_stats, req_head, anime_id)
    if data is None:
        return anime_info
    soup = BeautifulSoup(data.text, "html.parser")
    soup.script.decompose()
    
    # Scrape and store information in dict
    anime_info["MAL_Id"] = anime_id
    anime_info["Name"] = soup.find("h1", {"class": "title-name h1_bold_none"}).text.strip()

    score = soup.find("span", {"itemprop": "ratingValue"})
    if score is None:
        score = '?'
    try:
        anime_info['Score'] = score.text.strip()
    except:
        print('Empty Score')
        
    anime_info['Genres'] = [x.text.strip() for x in soup.findAll("span", {"itemprop": "genre"})]
    try:
        anime_info['Demographic'] = anime_info['Genres'][-1]
    except:
        print('Empty Genre')

    for s in soup.findAll("span", {"class": "dark_text"}):
        info = [x.strip().replace(" ", " ") for x in s.parent.text.split(":")]
        cat, v = info[0], ":".join(info[1:])
        v.replace("\t", "")
        
        if cat in ['Synonyms','Japanese','English']:
            cat += '_Name'
            v = v.replace(',', '')
            anime_info[cat] = v
            continue
        if cat in ['Broadcast','Genres','Demographic','Score'] or cat not in anime_info.keys():
            continue
        elif cat in ['Producers','Licensors','Studios']:
            v = [x.strip() for x in v.split(",")]
        elif cat in ['Ranked','Popularity']:
            v = v.replace('#',"")
            v = v.replace(',', '')
        elif cat in ['Members','Favorites','Watching','Completed','On-Hold','Dropped','Plan to Watch','Total']:
            v = v.replace(',','')
            
        anime_info[cat] = v

    # Scrape scoring stats
    for s in soup.find("div", {"id": "horiznav_nav"}).parent.findAll(
        "div", {"class": "updatesBar"}):
        cat = f"Score-{s.parent.parent.parent.find('td', class_='score-label').text}"
        v = ([x.strip() for x in s.parent.text.split("%")][-1].strip("(votes)"))
        anime_info[cat] = str(v).strip()
    return anime_info

# Helper function to write dict to csv as a new row
def write_new_row(file_name, d):
    if not file_name in os.listdir():
        with open(file_name,'w', encoding='utf-8') as f:
            writer = csv.writer(f, delimiter='|',lineterminator='\n')
            headers = list(d.keys())
            writer.writerow(headers)
    with open(file_name,'a', encoding='utf-8') as f:
        writer = csv.writer(f, delimiter='|',lineterminator='\n')
        values = []
        for k, v in d.items():
            values.append(str(v))
        writer.writerow(values)

# Scrape various information from the anime title through the links to its webpages
def scrape_anime(anime_id):
    #path = f"{HTML_PATH}/{anime_id}"
    #if f"{anime_id}.zip" in os.listdir(f'{HTML_PATH}'):
    #    return
    
    #os.makedirs(path, exist_ok=True)
    sleep()
    #data = requests.get(f"https://myanimelist.net/anime/{anime_id}", header=req_head)
    data = get_request(f"https://myanimelist.net/anime/{anime_id}", req_head, anime_id)
    if data is None:
        return
    
    soup = BeautifulSoup(data.text, "html.parser")
    soup.script.decompose()
    va = []
    for s in soup.find_all('td', class_='va-t ar pl4 pr4'):
        va.append(s.a.text)
    #save(f"{HTML_PATH}/{anime_id}/details.html", soup.prettify())
    
    # Get urls to detailed webpages
    link_review = get_link_by_text(soup, anime_id, "Reviews")
    link_recommendations = get_link_by_text(soup, anime_id, "Recommendations")
    link_stats = get_link_by_text(soup, anime_id, "Stats")
    #link_staff = get_link_by_text(soup, anime_id, "Characters & Staff")
    
    # Dict to store information
    key_list = ['MAL_Id','Name','Synonyms_Name','Japanese_Name','English_Name','Type','Episodes','Status','Aired','Premiered','Producers','Licensors','Studios','Source','Genres','Demographic','Duration','Rating','Score','Ranked','Popularity','Members','Favorites','Watching','Completed','On-Hold','Dropped','Plan to Watch','Total','Score-10','Score-9', 'Score-8', 'Score-7', 'Score-6', 'Score-5', 'Score-4','Score-3', 'Score-2', 'Score-1','Synopsis','Voice_Actors','Recommended_Ids','Recommended_Counts']
    anime_info = {key:'?' for key in key_list}
    
    # Scrape relevant information from the urls
    anime_info = scrape_anime_info(link_stats, anime_id, anime_info)
    anime_info['Synopsis'] = soup.find('p', {'itemprop':'description'}).text.replace('\r','').replace('\n','').replace('\t','')    
    anime_info['Voice_Actors'] = va
    rec_ids, rec_counts = get_recs(link_recommendations, anime_id)
    anime_info['Recommended_Ids'] = rec_ids
    anime_info['Recommended_Counts'] = rec_counts
    write_new_row('anime_info.csv', anime_info)
    
    soup_tags, soup_reviews = get_reviews(link_review, anime_id)
    if len(soup_tags) > 0 and len(soup_reviews) > 0:
        review_data = get_review_tags(soup_tags, soup_reviews, anime_id)
        write_new_reviews('anime_reviews.csv', review_data)
         
def scrape_all_anime_info(anime_list_file_name, i):
    df = pd.read_csv(anime_list_file_name)
    for aid in df.Id[i:]:
        scrape_anime(aid)
        i+=1
        print(f'Latest Title: {aid}, Title Completed: {i}/13300')
        if not i%20:
            print(time.asctime())

Finally, we have our functions that bring everything together. Using BeautifulSoup the relevant information within html tags are retrieved from the subpages and stored within a dictionary for each title. The dictionaries are then written onto disk.

df1 = pd.read_csv('anime_info.csv', on_bad_lines='warn', delimiter='|')
df1
MAL_Id Name Synonyms_Name Japanese_Name English_Name Type Episodes Status Aired Premiered ... Score-6 Score-5 Score-4 Score-3 Score-2 Score-1 Synopsis Voice_Actors Recommended_Ids Recommended_Counts
0 52991 Sousou no Frieren Frieren at the Funeral 葬送のフリーレン Frieren:Beyond Journey's End TV 28 Finished Airing Sep 29, 2023 to Mar 22, 2024 Fall 2023 ... 3191 1726 734 426 402 4100 During their decade-long quest to defeat the D... ['Tanezaki, Atsumi', 'Ichinose, Kana', 'Kobaya... ['33352', '41025', '35851', '486', '457', '296... ['14', '11', '8', '5', '5', '4', '4', '3', '2'...
1 5114 Fullmetal Alchemist: Brotherhood Hagane no Renkinjutsushi:Fullmetal Alchemist F... 鋼の錬金術師 FULLMETAL ALCHEMIST Fullmetal Alchemist:Brotherhood TV 64 Finished Airing Apr 5, 2009 to Jul 4, 2010 Spring 2009 ... 31930 15538 5656 2763 3460 50602 After a horrific alchemy experiment goes wrong... ['Park, Romi', 'Kugimiya, Rie', 'Miki, Shinich... ['11061', '16498', '1482', '38000', '9919', '1... ['74', '44', '21', '17', '16', '14', '14', '9'...
2 9253 Steins;Gate ? STEINS;GATE Steins;Gate TV 24 Finished Airing Apr 6, 2011 to Sep 14, 2011 Spring 2011 ... 31520 16580 8023 3740 2868 10054 Eccentric scientist Rintarou Okabe has a never... ['Miyano, Mamoru', 'Imai, Asami', 'Hanazawa, K... ['31043', '31240', '9756', '10620', '2236', '4... ['132', '130', '48', '26', '24', '19', '19', '...
3 28977 Gintama° Gintama' (2015) 銀魂° Gintama Season 4 TV 51 Finished Airing Apr 8, 2015 to Mar 30, 2016 Spring 2015 ... 6060 3601 1496 1011 1477 8616 Gintoki, Shinpachi, and Kagura return as the f... ['Sugita, Tomokazu', 'Kugimiya, Rie', 'Sakaguc... ['9863', '30276', '33255', '37105', '6347', '3... ['3', '2', '1', '1', '1', '1', '1', '1', '1', ...
4 38524 Shingeki no Kyojin Season 3 Part 2 ? 進撃の巨人 Season3 Part.2 Attack on Titan Season 3 Part 2 TV 10 Finished Airing Apr 29, 2019 to Jul 1, 2019 Spring 2019 ... 22287 8112 3186 1596 1308 12803 Seeking to restore humanity's diminishing hope... ['Kamiya, Hiroshi', 'Kaji, Yuuki', 'Ishikawa, ... ['28623', '37521', '25781', '2904', '36649', '... ['1', '1', '1', '1', '1', '1', '1', '1', '1', ...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
13295 49369 Shinkai no Survival! Kagaku Manga Survival 深海のサバイバル! ? Movie 1 Finished Airing Aug 13, 2021 ? ... 7 5 2 3 2 7 Second movie of Kagaku Manga Surivival learnin... [] [] []
13296 57798 Shinkalion: Change the World ? シンカリオン チェンジ ザ ワールド ? TV Unknown Currently Airing Apr 7, 2024 to ? Spring 2024 ... 31 34 15 7 3 5 Once upon a time, an unidentified enemy, the U... ['Tsuchiya, Shinba', 'Ono, Kensho', 'Ishibashi... [] []
13297 38114 Shinkansen Henkei Robo Shinkalion The Animatio... Shinkansen Henkei Robo Shinkalion:Soushuuhen -... 【新幹線変形ロボ シンカリオン】総集編『団らん!!速杉家とシンカリオン』 Bullet Train Transforming Robot Shinkalion The... ONA 1 Finished Airing Aug 2, 2018 ? ... 16 24 6 3 3 20 No synopsis information has been added to this... [] [] []
13298 22313 Shinken Densetsu: Tight Road True Fist Legend 真拳伝説 タイトロード ? TV 13 Finished Airing Oct 7, 1994 to Dec 28, 1994 Fall 1994 ... 12 14 8 3 2 9 No synopsis information has been added to this... ['Kamiyama, Masami'] [] []
13299 34697 Shinmai ? 新米 ? Special 4 Finished Airing Apr 15, 2008 ? ... 2 9 6 11 7 6 Everyone is brand new (shinmai) at the beginni... [] [] []

13300 rows × 43 columns

The above dataframe shows the final scraped dataset for all 13300 titles, containing significantly more information than what we had in the previous section.

df2 = pd.read_csv('anime_reviews.csv', delimiter='|', on_bad_lines='warn')
df2
MAL_Id Review Tags
0 52991 \r\n With lives so short, why d... ['Recommended', 'Preliminary']
1 52991 \r\n Frieren is the most overra... ['Not-Recommended', 'Funny', 'Preliminary']
2 52991 \r\n I feel so catered to.\r\n\... ['Recommended']
3 52991 \r\n Style-\r\r\nFrieren doesn'... ['Not-Recommended', 'Funny']
4 52991 \r\n Through 3 episodes, Friere... ['Mixed-Feelings', 'Preliminary']
... ... ... ...
77912 3287 \r\n Anime is and always has be... ['Not-Recommended']
77913 3287 \r\n If you've come to watch a ... ['Not-Recommended']
77914 3287 \r\n Giant Sqid Thingy is muh w... ['Recommended']
77915 3287 \r\n "It is not the fault of th... ['Recommended']
77916 3287 \r\n "Tenkuu Danzai Skelter+Hea... ['Recommended']

77917 rows × 3 columns

The above dataframe shows the scraped reviews data with their corresponding title id, allowing the possibility of using it for some text based collaborative recommendation system or some sentiment analysis in general.

4. Conclusion

In this notebook we have went through the process of scraping a sizeable dataset for future use. In the next part we will scrape user data from the same site using a different strategy.