Contents

  1. Overview
  2. Data Preprocessing
  3. Data Exploration
  4. Modelling
  5. Conclusion

1. Overview

In this notebook we will be exploring the IMDB dataset available on Kaggle, containing 50,000 reviews categorised as either positive or negative reviews. A text classification model will then be fine-tuned over DistilBERT and evaluated.


import pandas as pd
import numpy as np
import random
from collections import Counter
from datasets import load_dataset, Dataset

import seaborn as sns
import matplotlib.pyplot as plt

import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics, preprocessing
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, f1_score, roc_auc_score, roc_curve

from transformers import AutoTokenizer, DistilBertTokenizerFast
from transformers import DataCollatorWithPadding
from transformers import AutoModelForSequenceClassification
from transformers import TrainingArguments, Trainer
import torch
import evaluate
from bs4 import BeautifulSoup
import lxml
C:\Users\wenhao\anaconda3\envs\ML\Lib\site-packages\transformers\utils\generic.py:260: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  torch.utils._pytree._register_pytree_node(

2. Data Preprocessing

We will begin by initialising certain variables that will be used over the course of this notebook, before importing the dataset using pandas.

data_path = 'IMDB Dataset.csv' 
text_column_name = "review" 
label_column_name = "sentiment" 

model_name = "distilbert-base-uncased" 
test_size = 0.2 
num_labels = 2 

df = pd.read_csv(data_path)
df.head()
review sentiment
0 One of the other reviewers has mentioned that ... positive
1 A wonderful little production. <br /><br />The... positive
2 I thought this was a wonderful way to spend ti... positive
3 Basically there's a family where a little boy ... negative
4 Petter Mattei's "Love in the Time of Money" is... positive
df.isnull().sum()
review       0
sentiment    0
dtype: int64
df.sentiment.value_counts()
sentiment
positive    25000
negative    25000
Name: count, dtype: int64
df['review'][1]
'A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. <br /><br />The actors are extremely well chosen- Michael Sheen not only "has got all the polari" but he has all the voices down pat too! You can truly see the seamless editing guided by the references to Williams\' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. A masterful production about one of the great master\'s of comedy and his life. <br /><br />The realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional \'dream\' techniques remains solid then disappears. It plays on our knowledge and our senses, particularly with the scenes concerning Orton and Halliwell and the sets (particularly of their flat with Halliwell\'s murals decorating every surface) are terribly well done.'

We see that this is a balanced dataset with a 1:1 ratio of positive and negative reviews, hence there is no need to handle class imbalances.

Looking at the review we see that html tags exists in the dataset, we will remove them as part of data cleaning making use of BeautifulSoup package to parse our reviews. Additionally, we will also encode our “sentiment” column.

class Cleaner():
  def __init__(self):
    pass
  def put_line_breaks(self,text):
    text = text.replace('</p>','</p>\n')
    return text
  def remove_html_tags(self,text):
    cleantext = BeautifulSoup(text, "lxml").text
    return cleantext
  def clean(self,text):
    text = self.put_line_breaks(text)
    text = self.remove_html_tags(text)
    return text
cleaner = Cleaner()
df['text_cleaned'] = df[text_column_name].apply(cleaner.clean)
df.head()
C:\Users\wenhao\AppData\Local\Temp\ipykernel_34920\4246655815.py:8: MarkupResemblesLocatorWarning: The input looks more like a filename than markup. You may want to open this file and pass the filehandle into Beautiful Soup.
  cleantext = BeautifulSoup(text, "lxml").text
review sentiment text_cleaned
0 One of the other reviewers has mentioned that ... positive One of the other reviewers has mentioned that ...
1 A wonderful little production. <br /><br />The... positive A wonderful little production. The filming tec...
2 I thought this was a wonderful way to spend ti... positive I thought this was a wonderful way to spend ti...
3 Basically there's a family where a little boy ... negative Basically there's a family where a little boy ...
4 Petter Mattei's "Love in the Time of Money" is... positive Petter Mattei's "Love in the Time of Money" is...
le = preprocessing.LabelEncoder()
le.fit(df[label_column_name].tolist())
df['label'] = le.transform(df[label_column_name].tolist())
df.head()
review sentiment text_cleaned label
0 One of the other reviewers has mentioned that ... positive One of the other reviewers has mentioned that ... 1
1 A wonderful little production. <br /><br />The... positive A wonderful little production. The filming tec... 1
2 I thought this was a wonderful way to spend ti... positive I thought this was a wonderful way to spend ti... 1
3 Basically there's a family where a little boy ... negative Basically there's a family where a little boy ... 0
4 Petter Mattei's "Love in the Time of Money" is... positive Petter Mattei's "Love in the Time of Money" is... 1
df['wordcount'] = df['text_cleaned'].str.split().str.len()
df.head()
review sentiment text_cleaned label wordcount
0 One of the other reviewers has mentioned that ... positive One of the other reviewers has mentioned that ... 1 301
1 A wonderful little production. <br /><br />The... positive A wonderful little production. The filming tec... 1 156
2 I thought this was a wonderful way to spend ti... positive I thought this was a wonderful way to spend ti... 1 162
3 Basically there's a family where a little boy ... negative Basically there's a family where a little boy ... 0 132
4 Petter Mattei's "Love in the Time of Money" is... positive Petter Mattei's "Love in the Time of Money" is... 1 222
df.wordcount.describe()
count    50000.000000
mean       227.114620
std        168.278914
min          4.000000
25%        124.000000
50%        170.000000
75%        275.000000
max       2450.000000
Name: wordcount, dtype: float64

To help with the upcoming data exploration we have engineered a new feature to show the wordcount of each review entry, and a quick look shows most of the reviews have less than 275 words.

3. Data Exploration

Next, we will take a closer look at the distributions of wordcounts to investigate it they show any differences between positive and negative reviews.

fig, (ax1, ax2) = plt.subplots(1, 2, sharex=True)
sns.histplot(data=df.loc[df['sentiment']=='positive'], x='wordcount', ax=ax1, binwidth=50)
sns.histplot(data=df.loc[df['sentiment']=='negative'], x='wordcount', ax=ax2, binwidth=50)
fig.suptitle('Review Wordcount Distribution')
ax1.set_title('Positive reviews')
ax2.set_title('Negative reviews')
fig.tight_layout()
C:\Users\wenhao\anaconda3\envs\ML\Lib\site-packages\seaborn\_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
C:\Users\wenhao\anaconda3\envs\ML\Lib\site-packages\seaborn\_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):

png

fig, (ax1, ax2) = plt.subplots(1,2, sharey=True)
sns.boxplot(data=df.loc[df['sentiment']=='positive']['wordcount'].values, ax=ax1)
sns.boxplot(data=df.loc[df['sentiment']=='negative']['wordcount'].values, ax=ax2)
fig.suptitle('Review Wordcount Boxplot')
ax1.set_title('Positive reviews')
ax2.set_title('Negative reviews')
fig.tight_layout()

png

Looking at the above histograms and boxplots of wordcounts for Positive and Negative reviews, we see that the distributions are almost the same, except for a few outliers under our positive reviews.

Next we will look at some of the most common words and phrases within the dataset.

def most_common_words(df, n):
    corpus = []
    stop = set(stopwords.words('english'))
    for review in df.text_cleaned:
        for word in review.split():
            if word.strip().lower() not in stop and word.strip().lower().isalpha():
                corpus.append(word.strip())
    
    counter_words = Counter(corpus).most_common(n)
    counter_words = dict(counter_words)
    tmp = pd.DataFrame(columns = ["Word", 'Count'])
    tmp["Word"] = list(counter_words.keys())
    tmp['Count'] = list(counter_words.values())
    return tmp

# def most_common_ngrams(corpus, n, gram):
#     vec = CountVectorizer(ngram_range = (gram, gram)).fit(corpus)
#     bow = vec.transform(corpus)
#     word_sum = bow.sum(axis=0)
#     word_freq = [(word, word_sum[0, idx]) for word, idx in vec.vocabulary_.items()]
#     word_freq = sorted(word_freq, key = lambda x: x[1], reverse = True)
#     return word_freq[:n]

def most_common_ngrams(df, n, gram, name):
    corpus = []
    stop = set(stopwords.words('english'))
    for review in df.text_cleaned:
        words = review.split()
        words = [word for word in words if word not in stop]
        for i in range(len(words)-gram+1):
            corpus.append(' '.join(words[i:i+gram]))
    counter_ngrams = Counter(corpus).most_common(n)
    ngrams = dict(counter_ngrams)
    tmp = pd.DataFrame(columns = [str(name), 'Count'])
    tmp[str(name)] = list(ngrams.keys())
    tmp['Count'] = list(ngrams.values())
    return tmp
positive_corpus = most_common_words(df.loc[df['sentiment']=='positive'], 10)
negative_corpus = most_common_words(df.loc[df['sentiment']=='negative'], 10)

positive_bigram = most_common_ngrams(df.loc[df['sentiment']=='positive'], 10, 2, 'Bigram')
negative_bigram = most_common_ngrams(df.loc[df['sentiment']=='negative'], 10, 2, 'Bigram')
positive_trigram = most_common_ngrams(df.loc[df['sentiment']=='positive'], 10, 3, 'Trigram')
negative_trigram = most_common_ngrams(df.loc[df['sentiment']=='negative'], 10, 3, 'Trigram')
fig, (ax1, ax2) = plt.subplots(1,2)
sns.barplot(data=positive_corpus, x='Count', y='Word', ax=ax1, palette = 'Paired')
sns.barplot(data=negative_corpus, x='Count', y='Word', ax=ax2, palette = 'Paired')
fig.suptitle('10 most common words in reviews')
ax1.set_title('Positive Review')
ax2.set_title('Negative Review')
plt.tight_layout()

png

fig, (ax1, ax2) = plt.subplots(1,2)
sns.barplot(data=positive_bigram, x='Count', y='Bigram', ax=ax1, palette = 'Paired')
sns.barplot(data=negative_bigram, x='Count', y='Bigram', ax=ax2, palette = 'Paired')
fig.suptitle('10 most common bigrams in reviews')
ax1.set_title('Positive Review')
ax2.set_title('Negative Review')
plt.tight_layout()

png

fig, (ax1, ax2) = plt.subplots(1,2)
sns.barplot(data=positive_trigram, x='Count', y='Trigram', ax=ax1, palette = 'Paired')
sns.barplot(data=negative_trigram, x='Count', y='Trigram', ax=ax2, palette = 'Paired')
fig.suptitle('10 most common trigrams in reviews')
ax1.set_title('Positive Review')
ax2.set_title('Negative Review')
plt.tight_layout()

png

We are able to see phrases such as “I highly recommend” and “I would recommend” appearing in positive reviews giving a positive connotation while phrases like “one worst movies” and “worst movie I” appear in negative reviews giving a negative connotation. However, many of the most common words and phrases appear in both positive and negative reviews, suggesting that it would be better to focus on the exact phrases used rather than how frequently they are used.

For the purpose of data exploration, the above plots were created after removing stopwords from our dataset, in the next section where we fine tune a BERT model for text classification stopwords will be left in the dataset to provide context clues for our model.


4. Modelling (BERT)

In this section we will fine tune a BERT model for text classification, due to limited computation power, we will be sampling 10% of the original dataset to use for training and validation, and another 5% as our holdout set for model evaluation after training is completed.

# Reduce dataset to 5000 rows maintaining 1:1 ratio of review sentiments, shuffle and reset index
data = df.groupby('sentiment').apply(lambda x: x.sample(frac=0.15)).droplevel('sentiment')
holdout = pd.concat([data.iloc[:1250],data.iloc[6250:]]).sample(frac=1)
holdout = holdout.rename(columns={'label': 'true_label'})
holdout
review sentiment text_cleaned true_label wordcount
16369 Oh, this is such a glorious musical. There's a... positive Oh, this is such a glorious musical. There's a... 1 121
11852 As an old white housewife I can still apprecia... positive As an old white housewife I can still apprecia... 1 114
49783 Saw this movie at the Rotterdam IFF. You may q... positive Saw this movie at the Rotterdam IFF. You may q... 1 54
21792 The minutiae of what's involved in carrying ou... positive The minutiae of what's involved in carrying ou... 1 152
21757 Greetings again from the darkness. What a reli... positive Greetings again from the darkness. What a reli... 1 271
... ... ... ... ... ...
17962 I'm glad I rented this movie for one reason: i... negative I'm glad I rented this movie for one reason: i... 0 148
34361 Overall, a well done movie. There were the par... positive Overall, a well done movie. There were the par... 1 349
3722 Doesn't anyone bother to check where this kind... negative Doesn't anyone bother to check where this kind... 0 200
38341 Such a long awaited movie.. But it has disappo... negative Such a long awaited movie.. But it has disappo... 0 206
49517 This movie is a terrible attempt at a spoof. I... negative This movie is a terrible attempt at a spoof. I... 0 144

2500 rows × 5 columns

data = data.iloc[1250:6250].sample(frac=1)
data
review sentiment text_cleaned label wordcount
29963 This is Peter Falk's film. Period.<br /><br />... negative This is Peter Falk's film. Period.I was 10 yea... 0 246
35201 I don't see that much wrong with this movie. G... positive I don't see that much wrong with this movie. G... 1 218
25325 This movie wasn't that bad when compared to th... negative This movie wasn't that bad when compared to th... 0 151
44489 There is no greater disservice to do to histor... positive There is no greater disservice to do to histor... 1 538
7754 Posh Spice Victoria Beckham and her alleged ne... negative Posh Spice Victoria Beckham and her alleged ne... 0 657
... ... ... ... ... ...
44529 OK from the point of view of an American, who ... negative OK from the point of view of an American, who ... 0 189
11739 I love bad movies: Showgirls, Plan 9 from Oute... negative I love bad movies: Showgirls, Plan 9 from Oute... 0 122
3805 It was probably just my DVD---but I would not ... negative It was probably just my DVD---but I would not ... 0 185
5756 For some unknown reason, 7 years ago, I watche... negative For some unknown reason, 7 years ago, I watche... 0 127
37916 whomever thought of having sequels to Iron Eag... negative whomever thought of having sequels to Iron Eag... 0 219

5000 rows × 5 columns

# train / validation dataset splits
df_train, df_test = train_test_split(data[['text_cleaned','label']], test_size=test_size)
# converting out pandas dataframes to pytorch datasets
train_dataset = Dataset.from_pandas(df_train)
test_dataset = Dataset.from_pandas(df_test)
# tokenizer to convert our words to tokens
tokenizer = AutoTokenizer.from_pretrained(model_name, model_max_length=512)

def preprocess_function(examples):
    return tokenizer(examples["text_cleaned"], truncation=True)

tokenized_train = train_dataset.map(preprocess_function, batched=True)
tokenized_test = test_dataset.map(preprocess_function, batched=True)
Map:   0%|          | 0/4000 [00:00<?, ? examples/s]



Map:   0%|          | 0/1000 [00:00<?, ? examples/s]
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias', 'pre_classifier.weight', 'pre_classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
metric = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)
training_args = TrainingArguments(
    output_dir="./results",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=5,
    weight_decay=0.01,
    evaluation_strategy = "epoch",
    logging_strategy="epoch"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
    
)
trainer.train()
<div>

  <progress value='1250' max='1250' style='width:300px; height:20px; vertical-align: middle;'></progress>
  [1250/1250 08:39, Epoch 5/5]
</div>
<table border="1" class="dataframe">
Epoch Training Loss Validation Loss Accuracy 1 0.359300 0.253881 0.901000 2 0.176800 0.296619 0.906000 3 0.095000 0.434078 0.895000 4 0.041500 0.437259 0.906000 5 0.019900 0.460906 0.904000

</table><p>

TrainOutput(global_step=1250, training_loss=0.13849494819641114, metrics={'train_runtime': 520.6204, 'train_samples_per_second': 38.416, 'train_steps_per_second': 2.401, 'total_flos': 2603187224294592.0, 'train_loss': 0.13849494819641114, 'epoch': 5.0})

It appears that validation loss has increased after the first epoch while training loss continued to decrease, suggesting some overfitting happening in this case. However, looking at validation accuracy we see that the performance of the remained constant through all 5 epochs.

We will see if this performance holds true for our holdout set, extracting the model’s predictions (label and score) from our pipeline.

from transformers import pipeline
classifier = pipeline(
    task='text-classification',
    model=model,
    tokenizer=tokenizer,
    device=0,
    truncation=True,
    batch_size=8
)
holdout['result'] = holdout['text_cleaned'].apply(lambda x: classifier(x))
holdout['sentiment'] = holdout['result'].str[0].str['label']
holdout['score'] = holdout['result'].str[0].str['score']
holdout
C:\Users\wenhao\anaconda3\envs\ML\Lib\site-packages\transformers\pipelines\base.py:1090: UserWarning: You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
  warnings.warn(
review sentiment text_cleaned true_label wordcount result score
16369 Oh, this is such a glorious musical. There's a... LABEL_1 Oh, this is such a glorious musical. There's a... 1 121 [{'label': 'LABEL_1', 'score': 0.9983893632888... 0.998389
11852 As an old white housewife I can still apprecia... LABEL_1 As an old white housewife I can still apprecia... 1 114 [{'label': 'LABEL_1', 'score': 0.9985773563385... 0.998577
49783 Saw this movie at the Rotterdam IFF. You may q... LABEL_0 Saw this movie at the Rotterdam IFF. You may q... 1 54 [{'label': 'LABEL_0', 'score': 0.7969191670417... 0.796919
21792 The minutiae of what's involved in carrying ou... LABEL_1 The minutiae of what's involved in carrying ou... 1 152 [{'label': 'LABEL_1', 'score': 0.9989652633666... 0.998965
21757 Greetings again from the darkness. What a reli... LABEL_1 Greetings again from the darkness. What a reli... 1 271 [{'label': 'LABEL_1', 'score': 0.9985762834548... 0.998576
... ... ... ... ... ... ... ...
17962 I'm glad I rented this movie for one reason: i... LABEL_0 I'm glad I rented this movie for one reason: i... 0 148 [{'label': 'LABEL_0', 'score': 0.9990332126617... 0.999033
34361 Overall, a well done movie. There were the par... LABEL_0 Overall, a well done movie. There were the par... 1 349 [{'label': 'LABEL_0', 'score': 0.9182419180870... 0.918242
3722 Doesn't anyone bother to check where this kind... LABEL_0 Doesn't anyone bother to check where this kind... 0 200 [{'label': 'LABEL_0', 'score': 0.9989865422248... 0.998987
38341 Such a long awaited movie.. But it has disappo... LABEL_0 Such a long awaited movie.. But it has disappo... 0 206 [{'label': 'LABEL_0', 'score': 0.9987562894821... 0.998756
49517 This movie is a terrible attempt at a spoof. I... LABEL_0 This movie is a terrible attempt at a spoof. I... 0 144 [{'label': 'LABEL_0', 'score': 0.9990750551223... 0.999075

2500 rows × 7 columns

holdout['sentiment']= holdout['sentiment'].map({'LABEL_1':1, 'LABEL_0':0})
holdout
review sentiment text_cleaned true_label wordcount result score
16369 Oh, this is such a glorious musical. There's a... 1 Oh, this is such a glorious musical. There's a... 1 121 [{'label': 'LABEL_1', 'score': 0.9983893632888... 0.998389
11852 As an old white housewife I can still apprecia... 1 As an old white housewife I can still apprecia... 1 114 [{'label': 'LABEL_1', 'score': 0.9985773563385... 0.998577
49783 Saw this movie at the Rotterdam IFF. You may q... 0 Saw this movie at the Rotterdam IFF. You may q... 1 54 [{'label': 'LABEL_0', 'score': 0.7969191670417... 0.796919
21792 The minutiae of what's involved in carrying ou... 1 The minutiae of what's involved in carrying ou... 1 152 [{'label': 'LABEL_1', 'score': 0.9989652633666... 0.998965
21757 Greetings again from the darkness. What a reli... 1 Greetings again from the darkness. What a reli... 1 271 [{'label': 'LABEL_1', 'score': 0.9985762834548... 0.998576
... ... ... ... ... ... ... ...
17962 I'm glad I rented this movie for one reason: i... 0 I'm glad I rented this movie for one reason: i... 0 148 [{'label': 'LABEL_0', 'score': 0.9990332126617... 0.999033
34361 Overall, a well done movie. There were the par... 0 Overall, a well done movie. There were the par... 1 349 [{'label': 'LABEL_0', 'score': 0.9182419180870... 0.918242
3722 Doesn't anyone bother to check where this kind... 0 Doesn't anyone bother to check where this kind... 0 200 [{'label': 'LABEL_0', 'score': 0.9989865422248... 0.998987
38341 Such a long awaited movie.. But it has disappo... 0 Such a long awaited movie.. But it has disappo... 0 206 [{'label': 'LABEL_0', 'score': 0.9987562894821... 0.998756
49517 This movie is a terrible attempt at a spoof. I... 0 This movie is a terrible attempt at a spoof. I... 0 144 [{'label': 'LABEL_0', 'score': 0.9990750551223... 0.999075

2500 rows × 7 columns

print('Accuracy: ', accuracy_score(holdout['true_label'], holdout['sentiment']))
print('F1: ', f1_score(holdout['true_label'], holdout['sentiment']))
print('Confusion Matrix: ', confusion_matrix(holdout['true_label'], holdout['sentiment']))
Accuracy:  0.9044
F1:  0.9051210797935689
Confusion Matrix:  [[1121  129]
 [ 110 1140]]

Predicting on our holdout set, we see that the model achieves approximately 90.44% accuracy and an F1 score of 0.9051. This is consistent with our validation accuracy while training, suggesting that the model will perform well on similar reviews that are unseen.


5. Conclusion

In this notebook we have explored the IMDB Movie Reviews dataset using some techniques commonly found in NLP and sentiment analysis. We have also fine tuned a text classificaiton model on DistilBert, achieving a good performance of ~90% accuracy on unseen dataset.