by CM
Posted on October 26, 2019
Information and communications technologies, including new ML models, have fostered the rise of natural language processing, enabling models to give better prediction accuracy in various languages. In this article, we focus on a self-contained piece of a TensorFlow graph, along with its weights and assets that we reuse for Natural Language Processing -- in a process known as transfer learning. Transfer learning is a research concept in ML that uses knowledge gained while solving one problem and applying it to a different but related problem. The TensorFlow Hub module in this article is among the most prominent examples of transfer learning in NLP. We leverage Keras API to allow an easy implementation of TensorFlow Hub modules. In fact, the implications of transfer learning might differ between different tasks, however, given the results of this article, we find that transfer learning is highly valuable for Sentiment Analysis. We now start building a model that allows us to predict the sentiment of English text. The output should be either positive or negative. Lastly, we leverage those pretrained TensorFlow Hub modules that should provide as better accuracy for our predictions.
First, we upgrade to TensorFlow 2.0 via pip (pip is the package installer for Python). Based on your file versions, some requirements might already be satisfied.
### Upgrade to TensorFlow 2.0
!pip install --upgrade tensorflow
==========================
EXAMPLE OUTPUT
==========================
Collecting tensorflow
Downloading https://files.pythonhosted.org/packages/46/0f/7bd55361168bb32796b360ad15a25de6966c9c1beb58a8e30c01c8279862/tensorflow-2.0.0-cp36-cp36m-manylinux2010_x86_64.whl (86.3MB)
|████████████████████████████████| 86.3MB 114kB/s
Collecting tensorboard<2.1.0,>=2.0.0
Downloading https://files.pythonhosted.org/packages/9b/a6/e8ffa4e2ddb216449d34cfcb825ebb38206bee5c4553d69e7bc8bc2c5d64/tensorboard-2.0.0-py3-none-any.whl (3.8MB)
|████████████████████████████████| 3.8MB 41.4MB/s
Collecting tensorflow-estimator<2.1.0,>=2.0.0
Downloading https://files.pythonhosted.org/packages/fc/08/8b927337b7019c374719145d1dceba21a8bb909b93b1ad6f8fb7d22c1ca1/tensorflow_estimator-2.0.1-py2.py3-none-any.whl (449kB)
|████████████████████████████████| 450kB 46.7MB/s
Second, we import all dependencies - (note that those libraries come preinstalled with Colab. In case you are using e.g. Jupyter notebook on your local machine, make sure to install the respective libraries, e.g. using pip).
### Importing all dependencies
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
import keras
import tensorflow as tf
import tensorflow_hub as hub
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from sklearn.model_selection import train_test_split
We then need to upload our IMDB dataset to Colab. To upload data to Colab there are several options. In the following, I will introduce two popular ones: Option 1: Upload from local drive. Option 2: Upload from Google Drive.
Option 1: We can easily upload our IMDB dataset from our local storage. We use the pd.read_csv function to transform the uploaded CSV file into a Pandas dataframe.
from google.colab import files
uploaded = files.upload()
import io
Select for specific CSV locally
movie_reviews = pd.read_csv(io.BytesIO(uploaded['MAKE SURE TO PUT YOUR CSV FILE NAME HERE.csv']))
Option 2: We can read the CSV file directly from Google Drive. This option is often faster than the first Option, although it requires authentication to your Google Drive account.
#Mount Google Drive
from google.colab import drive
drive.mount('/content/gdrive')
------------------------------------------------------------------------------------------
Enter your authorization code: [Your Code]
··········
Mounted at /content/gdrive
------------------------------------------------------------------------------------------
We then store the CSV file in a pandas dataframe for further use. Make sure to input your file path and the "file name + .csv" correctly.
movie_reviews = pd.read_csv('[Your Path] IMDB Dataset.csv')
After we have uploaded our CSV file, we will review the file in detail. First, we check whether it has been successfully uploaded into a Pandas dataframe. Then we determine if ANY value in the CSV file is missing. We print the shape of the dataframe and the head of the dataframe (the first five rows) to make a quick vizual check.
print(type(movie_reviews))
movie_reviews.isnull().values.any()
print(movie_reviews.shape)
print(movie_reviews.head())
We find that our movie_reviews is of type Pandas Dataframe and has a shape of 50000 rows with two columns (review, sentiment). Moreover, we find that the reviews hold html tags as well as single characters that might not be helpful in identifying whether the review has a positive or negative sentiment.
==========================
OUTPUT
==========================
<class 'pandas.core.frame.DataFrame'>
(50000, 2)
review sentiment
0 One of the other reviewers has mentioned that ... positive
1 A wonderful little production. <br /><br />The... positive
2 I thought this was a wonderful way to spend ti... positive
3 Basically there's a family where a little boy ... negative
4 Petter Mattei's "Love in the Time of Money" is... positive
In addition, it is highly recommended that the data in not fully skewed towards positive or negative data. Hence, we plot the sentiment column of the dataframe using seaborn 'countplot' function to have a look at the distribution of both sentiments.
sns.countplot(x='sentiment',
edgecolor=(0,0,0),
linewidth=2,
palette="Dark2",
data=movie_reviews)
We find that we have 25000 positive and 25000 negative reviews in our dataset. Hence, our data is even respectively not severly skewed.
========================== OUTPUT ==========================We now preprocess the reviews in our dataframe. In other words, we make sure that the dataframe only contains the relevant words that might indicate a sentiment, hence, removing all other characters that do not count into the explanation of the sentiment.
In particular, we build a function to remove those characters:
def preprocess_text(sen):
# Removing html tags
sentence = re.sub(r'<[^>]+>','', sen)
# Remove punctuations and numbers
sentence = re.sub('[^a-zA-Z]', ' ', sentence)
# Single character removal
sentence = re.sub(r"\s+[a-zA-Z]\s+", ' ', sentence)
# Removing multiple spaces
sentence = re.sub(r'\s+', ' ', sentence)
return sentence
We now apply the just defined preprocess function to our review column. In this regard, we create an empty array and append each review after it has been processed to this array
X = []
sentences = list(movie_reviews['review'])
for sen in sentences:
X.append(preprocess_text(sen))
We then take the 'sentiment' column of our dataframe and replace the string with integers via a lamda function (anonymus function). In detail, we repalce the 'positive' string with a 1 and the 'negative' string with 0. Recap, in Python, anonymous function describes a function is without a name compared to normal functions (def) that hold a name.
y = movie_reviews['sentiment']
y = np.array(list(map(lambda x: 0 if x=="negative" else 1, y)))
We can check whether the transformation worked, via a two simple print functions.
print(movie_reviews.sentiment.head())
print(y[:5])
Your output should look like this:
==========================
OUTPUT
==========================
0 positive
1 positive
2 positive
3 negative
4 positive
[1 1 1 0 1]
Perfect -- that array reselbles the sentiment column now (1 = positive, 0 = negative. Now we start with the computational Machine Learning Part. First splitting the data in training (80%) and test data (20%), respectively validation data, giving it a random number as a seed. The train_test_split returns list.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=88)
We then transform the newly generated list into numpy arrays.
X_train = np.asarray(X_train)
X_test = np.asarray(X_test)
print(type(X_train))
print(type(y_train))
print(type(X_test))
print(type(y_test))
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
We now make use of TensorFlow Hub Module. The Hub module transforms our np.array string into a tensor. In other words, the module takes a batch of sentences in a 1-D tensor of strings as input while embedding it into a 20-dimensional vector. The token based text embedding was trained on English Google News 130GB corpus. In this regard, we define our new custom layer. A so-called KerasLayer that we will use in our model.
#20
hub_layer = hub.KerasLayer("https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1",
output_shape=[20],
input_shape=[],
dtype=tf.string,
trainable=True)
hub_layer(X_train[[0]])
==========================
OUTPUT
==========================
<tf.Tensor: id=689, shape=(1, 20), dtype=float32, numpy=
array([[ 4.0993705 , -5.788245 , 5.5879493 , -1.0455519 , -6.2843003 ,
-5.5901012 , -2.132286 , 2.0877192 , 4.2431483 , -1.4409944 ,
-2.2836738 , 1.532598 , -1.9618702 , 0.29757962, -7.0124855 ,
3.0051506 , 6.3973274 , -3.191991 , -5.040606 , -2.0053916 ]],
dtype=float32)>
Finally, we build our model. We use a Sequential Model with only three layers. Our custom layer and two dense layers.
# Model
model = Sequential()
model.add(hub_layer)
model.add(tf.keras.layers.Dense(200, activation='relu'))
model.add(tf.keras.layers.Dense(1, activation='sigmoid'))
model.summary()
==========================
OUTPUT
==========================
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
keras_layer (KerasLayer) (None, 20) 400020
_________________________________________________________________
dense (Dense) (None, 200) 4200
_________________________________________________________________
dense_1 (Dense) (None, 1) 201
=================================================================
Total params: 404,421
Trainable params: 404,421
Non-trainable params: 0
_________________________________________________________________
Before training a model, we need to configure the learning process, which is done via the compile method. It receives three arguments:
Before we coompile our model, we double the learning_rate of the Adam optimizer. We now complie our model:
tf.keras.optimizers.Adam(learning_rate=0.002, beta_1=0.9, beta_2=0.999, amsgrad=False)
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc', 'mae','mse'])
After compliling, we fit respectively train our model.
history = model.fit(X_train, y_train, batch_size=750, epochs=8, verbose=1, validation_split=0.20)
==========================
OUTPUT
==========================
Train on 32000 samples, validate on 8000 samples
Epoch 1/8
32000/32000 [==============================] - 4s 140us/sample - loss: 0.6236 - acc: 0.6498 - mae: 0.4366 - mse: 0.2172 - val_loss: 0.5678 - val_acc: 0.7130 - val_mae: 0.4044 - val_mse: 0.1926
Epoch 2/8
32000/32000 [==============================] - 4s 115us/sample - loss: 0.5380 - acc: 0.7344 - mae: 0.3803 - mse: 0.1801 - val_loss: 0.4973 - val_acc: 0.7670 - val_mae: 0.3512 - val_mse: 0.1635
Epoch 3/8
32000/32000 [==============================] - 4s 116us/sample - loss: 0.4583 - acc: 0.7908 - mae: 0.3210 - mse: 0.1482 - val_loss: 0.4188 - val_acc: 0.8149 - val_mae: 0.2920 - val_mse: 0.1334
Epoch 4/8
32000/32000 [==============================] - 4s 117us/sample - loss: 0.3805 - acc: 0.8366 - mae: 0.2631 - mse: 0.1193 - val_loss: 0.3590 - val_acc: 0.8449 - val_mae: 0.2433 - val_mse: 0.1123
Epoch 5/8
32000/32000 [==============================] - 4s 117us/sample - loss: 0.3257 - acc: 0.8639 - mae: 0.2206 - mse: 0.1000 - val_loss: 0.3234 - val_acc: 0.8624 - val_mae: 0.2134 - val_mse: 0.1001
Epoch 6/8
32000/32000 [==============================] - 4s 117us/sample - loss: 0.2820 - acc: 0.8837 - mae: 0.1901 - mse: 0.0851 - val_loss: 0.3003 - val_acc: 0.8721 - val_mae: 0.1937 - val_mse: 0.0923
Epoch 7/8
32000/32000 [==============================] - 4s 116us/sample - loss: 0.2512 - acc: 0.8988 - mae: 0.1686 - mse: 0.0749 - val_loss: 0.2907 - val_acc: 0.8785 - val_mae: 0.1809 - val_mse: 0.0891
Epoch 8/8
32000/32000 [==============================] - 4s 118us/sample - loss: 0.2288 - acc: 0.9090 - mae: 0.1519 - mse: 0.0675 - val_loss: 0.2806 - val_acc: 0.8819 - val_mae: 0.1707 - val_mse: 0.0857
We find that our accuracy of our training data is 0.9001 and validation accuracy is 0.8754 -- pretty good already. But how does it perform on our test data?
score = model.evaluate(X_test, y_test, verbose=1)
print("Test Score:", score[0])
print("Test Accuracy:", score[1])
print("Test Mean absolute error:", score[2])
print("Test Mean squared error:", score[3]))
==========================
OUTPUT
==========================
Test Score: 0.2876012816429138
Test Accuracy: 0.8795
Test Mean absolute error: 0.17210841
Test Mean squared error: 0.087166086
We find a satisfying accuracy of 0.88 with a MAE of 0.17, MSE of 0.09, and 0.32 Loss. We print our epochs training vs. validation to identify further overfitting or underfitting, respectively.
#Model Accuracy
ax = plt.gca()
ax.set_facecolor('tab:grey')
plt.plot(history.history['acc'], color = 'k')
plt.plot(history.history['val_acc'], linestyle ='--', color='navy')
plt.title('Model Accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train','val'], loc = 'center')
plt.grid(color='black', linestyle='-', linewidth=1, alpha=0.3)
plt.show()
#Model Loss
ax = plt.gca()
ax.set_facecolor('tab:grey')
plt.plot(history.history['loss'], color = 'k')
plt.plot(history.history['val_loss'], linestyle ='--', color='navy')
plt.title('Model Loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train','val'], loc = 'center')
plt.grid(color='black', linestyle='-', linewidth=1, alpha=0.3)
plt.show()
#Model MAE
ax = plt.gca()
ax.set_facecolor('tab:grey')
plt.plot(history.history['mae'], color = 'k')
plt.plot(history.history['val_mae'], linestyle ='--', color='navy')
plt.title('Model MAE')
plt.ylabel('mae')
plt.xlabel('epoch')
plt.legend(['train','val'], loc = 'center')
plt.grid(color='black', linestyle='-', linewidth=1, alpha=0.3)
plt.show()
#Model MSE
ax = plt.gca()
ax.set_facecolor('tab:grey')
plt.plot(history.history['mse'], color = 'k')
plt.plot(history.history['val_mse'], linestyle ='--', color='navy')
plt.title('Model MSE')
plt.ylabel('mse')
plt.xlabel('epoch')
plt.legend(['train','val'], loc = 'center')
plt.grid(color='black', linestyle='-', linewidth=1, alpha=0.3)
plt.show()
All four plots show similar trends for training and validation with respect to Accuract, MAE, MSE, and Loss. This is how it should look like!
We now test our model and let it predict on an obviously positive review. In this regard, we print the prediction as well as the review.
my_array = [["I fits perfectly. These sandals are beautiful and very well made. I am looking forward to warmer weather so I can wear these sandals!"]]
print(my_array[0])
prediction = model.predict(my_array[0])
print(prediction)
Our models predicts that the review is of positive connotation (with a probability of about 94%).
==========================
OUTPUT
==========================
['I fits perfectly. These sandals are beautiful and very well made. I am looking forward to warmer weather so I can wear these sandals!']
[[0.9399543]]
We have done it. We have built a powerful ML model that allows us to predict the sentiment of a text with high accuracy.