Sentiment Analysis with Keras, Tensorflow 2.0, TensorFlow Hub using IMBD data in Google Colab.

by CM


Posted on October 26, 2019



The Goal:

Information and communications technologies, including new ML models, have fostered the rise of natural language processing, enabling models to give better prediction accuracy in various languages. In this article, we focus on a self-contained piece of a TensorFlow graph, along with its weights and assets that we reuse for Natural Language Processing -- in a process known as transfer learning. Transfer learning is a research concept in ML that uses knowledge gained while solving one problem and applying it to a different but related problem. The TensorFlow Hub module in this article is among the most prominent examples of transfer learning in NLP. We leverage Keras API to allow an easy implementation of TensorFlow Hub modules. In fact, the implications of transfer learning might differ between different tasks, however, given the results of this article, we find that transfer learning is highly valuable for Sentiment Analysis.

We now start building a model that allows us to predict the sentiment of English text. The output should be either positive or negative. Lastly, we leverage those pretrained TensorFlow Hub modules that should provide as better accuracy for our predictions.


Key components are:

Dataset:
>> IMDB dataset having 50K movie review

First, we upgrade to TensorFlow 2.0 via pip (pip is the package installer for Python). Based on your file versions, some requirements might already be satisfied.

### Upgrade to TensorFlow 2.0
!pip install --upgrade tensorflow
==========================
EXAMPLE OUTPUT
==========================

Collecting tensorflow
  Downloading https://files.pythonhosted.org/packages/46/0f/7bd55361168bb32796b360ad15a25de6966c9c1beb58a8e30c01c8279862/tensorflow-2.0.0-cp36-cp36m-manylinux2010_x86_64.whl (86.3MB)
     |████████████████████████████████| 86.3MB 114kB/s
Collecting tensorboard<2.1.0,>=2.0.0
  Downloading https://files.pythonhosted.org/packages/9b/a6/e8ffa4e2ddb216449d34cfcb825ebb38206bee5c4553d69e7bc8bc2c5d64/tensorboard-2.0.0-py3-none-any.whl (3.8MB)
     |████████████████████████████████| 3.8MB 41.4MB/s
Collecting tensorflow-estimator<2.1.0,>=2.0.0
  Downloading https://files.pythonhosted.org/packages/fc/08/8b927337b7019c374719145d1dceba21a8bb909b93b1ad6f8fb7d22c1ca1/tensorflow_estimator-2.0.1-py2.py3-none-any.whl (449kB)
     |████████████████████████████████| 450kB 46.7MB/s

Second, we import all dependencies - (note that those libraries come preinstalled with Colab. In case you are using e.g. Jupyter notebook on your local machine, make sure to install the respective libraries, e.g. using pip).

### Importing all dependencies
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re

import keras
import tensorflow as tf
import tensorflow_hub as hub

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

from sklearn.model_selection import train_test_split

We then need to upload our IMDB dataset to Colab. To upload data to Colab there are several options. In the following, I will introduce two popular ones:
Option 1: Upload from local drive.
Option 2: Upload from Google Drive.

Option 1: We can easily upload our IMDB dataset from our local storage. We use the pd.read_csv function to transform the uploaded CSV file into a Pandas dataframe.

from google.colab import files
uploaded = files.upload()
import io

Select for specific CSV locally
movie_reviews = pd.read_csv(io.BytesIO(uploaded['MAKE SURE TO PUT YOUR CSV FILE NAME HERE.csv']))

Option 2: We can read the CSV file directly from Google Drive. This option is often faster than the first Option, although it requires authentication to your Google Drive account.

#Mount Google Drive
from google.colab import drive
drive.mount('/content/gdrive')

------------------------------------------------------------------------------------------
Enter your authorization code: [Your Code]
··········
Mounted at /content/gdrive
------------------------------------------------------------------------------------------

We then store the CSV file in a pandas dataframe for further use. Make sure to input your file path and the "file name + .csv" correctly.

movie_reviews = pd.read_csv('[Your Path] IMDB Dataset.csv')

After we have uploaded our CSV file, we will review the file in detail. First, we check whether it has been successfully uploaded into a Pandas dataframe. Then we determine if ANY value in the CSV file is missing. We print the shape of the dataframe and the head of the dataframe (the first five rows) to make a quick vizual check.

print(type(movie_reviews))
movie_reviews.isnull().values.any()

print(movie_reviews.shape)
print(movie_reviews.head())

We find that our movie_reviews is of type Pandas Dataframe and has a shape of 50000 rows with two columns (review, sentiment). Moreover, we find that the reviews hold html tags as well as single characters that might not be helpful in identifying whether the review has a positive or negative sentiment.

==========================
OUTPUT
==========================

<class 'pandas.core.frame.DataFrame'>
(50000, 2)
                                              review sentiment
0  One of the other reviewers has mentioned that ...  positive
1  A wonderful little production. <br /><br />The...  positive
2  I thought this was a wonderful way to spend ti...  positive
3  Basically there's a family where a little boy ...  negative
4  Petter Mattei's "Love in the Time of Money" is...  positive

In addition, it is highly recommended that the data in not fully skewed towards positive or negative data. Hence, we plot the sentiment column of the dataframe using seaborn 'countplot' function to have a look at the distribution of both sentiments.

sns.countplot(x='sentiment',
              edgecolor=(0,0,0),
              linewidth=2,
              palette="Dark2",
              data=movie_reviews)

We find that we have 25000 positive and 25000 negative reviews in our dataset. Hence, our data is even respectively not severly skewed.

========================== OUTPUT ==========================

We now preprocess the reviews in our dataframe. In other words, we make sure that the dataframe only contains the relevant words that might indicate a sentiment, hence, removing all other characters that do not count into the explanation of the sentiment.

In particular, we build a function to remove those characters:

  • (1) In the first step, we remove all html tags
  • (2) In the second step, we split sentences into their constituent words and remove unnecessary characters
  • (3) In the third step, eliminate standalone characters
  • (4) In the fourth step, eliminate duplicate whitespaces
def preprocess_text(sen):
      # Removing html tags
      sentence = re.sub(r'<[^>]+>','', sen)

      # Remove punctuations and numbers
      sentence = re.sub('[^a-zA-Z]', ' ', sentence)

      # Single character removal
      sentence = re.sub(r"\s+[a-zA-Z]\s+", ' ', sentence)

      # Removing multiple spaces
      sentence = re.sub(r'\s+', ' ', sentence)

      return sentence

We now apply the just defined preprocess function to our review column. In this regard, we create an empty array and append each review after it has been processed to this array

X = []
  sentences = list(movie_reviews['review'])
  for sen in sentences:
      X.append(preprocess_text(sen))

We then take the 'sentiment' column of our dataframe and replace the string with integers via a lamda function (anonymus function). In detail, we repalce the 'positive' string with a 1 and the 'negative' string with 0. Recap, in Python, anonymous function describes a function is without a name compared to normal functions (def) that hold a name.

y = movie_reviews['sentiment']

y = np.array(list(map(lambda x: 0 if x=="negative" else 1, y)))

We can check whether the transformation worked, via a two simple print functions.

print(movie_reviews.sentiment.head())
  print(y[:5])

Your output should look like this:

==========================
OUTPUT
==========================
0    positive
1    positive
2    positive
3    negative
4    positive

[1 1 1 0 1]

Perfect -- that array reselbles the sentiment column now (1 = positive, 0 = negative. Now we start with the computational Machine Learning Part. First splitting the data in training (80%) and test data (20%), respectively validation data, giving it a random number as a seed. The train_test_split returns list.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=88)

We then transform the newly generated list into numpy arrays.

X_train = np.asarray(X_train)
X_test = np.asarray(X_test)

print(type(X_train))
print(type(y_train))
print(type(X_test))
print(type(y_test))

<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>

We now make use of TensorFlow Hub Module. The Hub module transforms our np.array string into a tensor. In other words, the module takes a batch of sentences in a 1-D tensor of strings as input while embedding it into a 20-dimensional vector. The token based text embedding was trained on English Google News 130GB corpus. In this regard, we define our new custom layer. A so-called KerasLayer that we will use in our model.

#20
hub_layer = hub.KerasLayer("https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1",
                           output_shape=[20],
                           input_shape=[],
                           dtype=tf.string,
                           trainable=True)
hub_layer(X_train[[0]])

==========================
OUTPUT
==========================

<tf.Tensor: id=689, shape=(1, 20), dtype=float32, numpy=
array([[ 4.0993705 , -5.788245  ,  5.5879493 , -1.0455519 , -6.2843003 ,
        -5.5901012 , -2.132286  ,  2.0877192 ,  4.2431483 , -1.4409944 ,
        -2.2836738 ,  1.532598  , -1.9618702 ,  0.29757962, -7.0124855 ,
         3.0051506 ,  6.3973274 , -3.191991  , -5.040606  , -2.0053916 ]],
      dtype=float32)>

Finally, we build our model. We use a Sequential Model with only three layers. Our custom layer and two dense layers.

# Model
model = Sequential()
model.add(hub_layer)
model.add(tf.keras.layers.Dense(200, activation='relu'))
model.add(tf.keras.layers.Dense(1, activation='sigmoid'))

model.summary()

==========================
OUTPUT
==========================

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
keras_layer (KerasLayer)     (None, 20)                400020
_________________________________________________________________
dense (Dense)                (None, 200)               4200
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 201
=================================================================
Total params: 404,421
Trainable params: 404,421
Non-trainable params: 0
_________________________________________________________________

Before training a model, we need to configure the learning process, which is done via the compile method. It receives three arguments:

Before we coompile our model, we double the learning_rate of the Adam optimizer. We now complie our model:

tf.keras.optimizers.Adam(learning_rate=0.002, beta_1=0.9, beta_2=0.999, amsgrad=False)

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc', 'mae','mse'])

After compliling, we fit respectively train our model.

history = model.fit(X_train, y_train, batch_size=750, epochs=8, verbose=1, validation_split=0.20)

==========================
OUTPUT
==========================

Train on 32000 samples, validate on 8000 samples
Epoch 1/8
32000/32000 [==============================] - 4s 140us/sample - loss: 0.6236 - acc: 0.6498 - mae: 0.4366 - mse: 0.2172 - val_loss: 0.5678 - val_acc: 0.7130 - val_mae: 0.4044 - val_mse: 0.1926
Epoch 2/8
32000/32000 [==============================] - 4s 115us/sample - loss: 0.5380 - acc: 0.7344 - mae: 0.3803 - mse: 0.1801 - val_loss: 0.4973 - val_acc: 0.7670 - val_mae: 0.3512 - val_mse: 0.1635
Epoch 3/8
32000/32000 [==============================] - 4s 116us/sample - loss: 0.4583 - acc: 0.7908 - mae: 0.3210 - mse: 0.1482 - val_loss: 0.4188 - val_acc: 0.8149 - val_mae: 0.2920 - val_mse: 0.1334
Epoch 4/8
32000/32000 [==============================] - 4s 117us/sample - loss: 0.3805 - acc: 0.8366 - mae: 0.2631 - mse: 0.1193 - val_loss: 0.3590 - val_acc: 0.8449 - val_mae: 0.2433 - val_mse: 0.1123
Epoch 5/8
32000/32000 [==============================] - 4s 117us/sample - loss: 0.3257 - acc: 0.8639 - mae: 0.2206 - mse: 0.1000 - val_loss: 0.3234 - val_acc: 0.8624 - val_mae: 0.2134 - val_mse: 0.1001
Epoch 6/8
32000/32000 [==============================] - 4s 117us/sample - loss: 0.2820 - acc: 0.8837 - mae: 0.1901 - mse: 0.0851 - val_loss: 0.3003 - val_acc: 0.8721 - val_mae: 0.1937 - val_mse: 0.0923
Epoch 7/8
32000/32000 [==============================] - 4s 116us/sample - loss: 0.2512 - acc: 0.8988 - mae: 0.1686 - mse: 0.0749 - val_loss: 0.2907 - val_acc: 0.8785 - val_mae: 0.1809 - val_mse: 0.0891
Epoch 8/8
32000/32000 [==============================] - 4s 118us/sample - loss: 0.2288 - acc: 0.9090 - mae: 0.1519 - mse: 0.0675 - val_loss: 0.2806 - val_acc: 0.8819 - val_mae: 0.1707 - val_mse: 0.0857

We find that our accuracy of our training data is 0.9001 and validation accuracy is 0.8754 -- pretty good already. But how does it perform on our test data?

score = model.evaluate(X_test, y_test, verbose=1)
  print("Test Score:", score[0])
  print("Test Accuracy:", score[1])
  print("Test Mean absolute error:", score[2])
  print("Test Mean squared error:", score[3]))

==========================
OUTPUT
==========================

Test Score: 0.2876012816429138
Test Accuracy: 0.8795
Test Mean absolute error: 0.17210841
Test Mean squared error: 0.087166086

We find a satisfying accuracy of 0.88 with a MAE of 0.17, MSE of 0.09, and 0.32 Loss. We print our epochs training vs. validation to identify further overfitting or underfitting, respectively.

#Model Accuracy
ax = plt.gca()
ax.set_facecolor('tab:grey')

plt.plot(history.history['acc'], color  = 'k')
plt.plot(history.history['val_acc'], linestyle ='--', color='navy')

plt.title('Model Accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train','val'], loc = 'center')

plt.grid(color='black', linestyle='-', linewidth=1, alpha=0.3)
plt.show()

#Model Loss
ax = plt.gca()
ax.set_facecolor('tab:grey')

plt.plot(history.history['loss'], color  = 'k')
plt.plot(history.history['val_loss'], linestyle ='--', color='navy')

plt.title('Model Loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train','val'], loc = 'center')

plt.grid(color='black', linestyle='-', linewidth=1, alpha=0.3)

plt.show()

#Model MAE
ax = plt.gca()
ax.set_facecolor('tab:grey')

plt.plot(history.history['mae'], color  = 'k')
plt.plot(history.history['val_mae'], linestyle ='--', color='navy')

plt.title('Model MAE')
plt.ylabel('mae')
plt.xlabel('epoch')
plt.legend(['train','val'], loc = 'center')

plt.grid(color='black', linestyle='-', linewidth=1, alpha=0.3)

plt.show()


#Model MSE
ax = plt.gca()
ax.set_facecolor('tab:grey')

plt.plot(history.history['mse'], color  = 'k')
plt.plot(history.history['val_mse'], linestyle ='--', color='navy')

plt.title('Model MSE')
plt.ylabel('mse')
plt.xlabel('epoch')
plt.legend(['train','val'], loc = 'center')

plt.grid(color='black', linestyle='-', linewidth=1, alpha=0.3)

plt.show()

All four plots show similar trends for training and validation with respect to Accuract, MAE, MSE, and Loss. This is how it should look like!

We now test our model and let it predict on an obviously positive review. In this regard, we print the prediction as well as the review.

my_array = [["I fits perfectly. These sandals are beautiful and very well made. I am looking forward to warmer weather so I can wear these sandals!"]]
print(my_array[0])
prediction  = model.predict(my_array[0])
print(prediction)

Our models predicts that the review is of positive connotation (with a probability of about 94%).

==========================
OUTPUT
==========================

['I fits perfectly. These sandals are beautiful and very well made. I am looking forward to warmer weather so I can wear these sandals!']
[[0.9399543]]

We have done it. We have built a powerful ML model that allows us to predict the sentiment of a text with high accuracy.

Leverage TensorFlow Hub and Keras!

#EpicML


News
Dec 2021

--- Quantum ---

Simulating matter on the quantum scale with AI #Deepmind
Nov 2021

--- Graviton3 ---

Amazon announced its Graviton3 processors for AI inferencing - the next generation of its custom ARM-based chip for AI inferencing applications. #Graviton3
May 2021

--- Vertex AI & TPU Gen4. ---

Google announced its fourth generation of tensor processing units (TPUs) for AI and ML workloads and the Vertex AI managed platform #VertexAI #TPU
Feb 2021

--- TensorFlow 3D ---

In February of 2021, Google released TensorFlow 3D to help enterprises develop and train models capable of understanding 3D scenes #TensorFlow3D
Nov 2020

--- AlphaFold ---

In November of 2020, AlphaFold 2 was recognised as a solution to the protein folding problem at CASP14 #protein_folding
Oct 2019

--- Google Quantum ---

A research effort from Google AI that aims to build quantum processors and develop novel quantum algorithms to dramatically accelerate computational tasks for machine learning. #quantum_supremacy
Oct 2016

--- AlphaGo ---

Mastering the game of Go with Deep Neural Networks. #neural_network