How to do a Kaggle Competition?

by CM

Posted on December 06, 2019

The Goal:

Kaggle is an online community of data scientists and machine learners that allows users to find and publish data sets, explore and build models in a web-based data-science environment, and enter competitions to tackle data science challenges. In outcome, Kaggle provides ML competitions, public data sets, a cloud-based workbench for data science, and short form AI education. In this article, we explore Kaggle to participate in our first Kaggle competition. In this regard, we will enter the Kannada MNIST competition. This competition holds an MNIST like datatset for Kannada handwritten digits. In other words, the goal of this competition is to provide a simple extension to the classic MNIST competition we're all familiar with. Instead of using Arabic numerals, it uses a recently-released dataset of Kannada digits.

Key components are:

Dataset:

Kaggle Source (.csv Format)

>> The data files train.csv and test.csv contain gray-scale images of hand-drawn digits, from zero through nine, in the Kannada script. Each image is 28 pixels in height and 28 pixels in width, for a total of 784 pixels in total. Each pixel has a single pixel-value associated with it, indicating the lightness or darkness of that pixel, with higher numbers meaning darker. This pixel-value is an integer between 0 and 255, inclusive. The training data set, train.csv, has 785 columns. The first column, called label, is the digit that was drawn by the user. The rest of the columns contain the pixel-values of the associated image.

Note: That this article only covers the bare minimum of entering a Kaggle competition. Hence, we will just explore the basic but necessary steps in order to be successful. First, we will go on Kaggle to look for competitions. In the search field, enter: "Kannada MNIST". You will be directed to the Playground Code Competition -- Kannada MNIST.

On the landing page, we need to have a look at five things:

(1) Description
(2) Evaluation
(3) The Timeline
(4) Kernel Requirements
(5) Data

(1) In the DESCRIPTION, we usually get the infomration what the challenge is about, respectively what we will later predict using the provided data. In other words, the tasks including useful background information can be found in this section. (2) EVALUATION gives us the information how the competition is evaluated, e.g. accuracy of our predictions (the percentage of images we get correct). Further, we find the exact format the our final "Submission file" needs to have in order to get accepted to enter the competition. As the name suggest, (3) TIMELINE a) Entry deadline, (b) Team Merger deadline, and (c) Final Submission Deadline. (4) KERNEL REQUIREMENTS gives us insights on the platform to run our Jupyter notebooks in the browser.

Under (5) Data, we usually find the training and test data that we can use for building our model.

In the given competition, we will try to correctly classify images to respective numbers (0-9). So let's start.

First, we import all our dependencies:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

Second, we load the training and test data into pandas dataframes using the pd.rad_csv function. In this regard, we need to have a look in what file-format the training and test data is -- in our competition the data is available as CSV files. We find the data and the respective path under "input/Kannada MNIST"

training = pd.read_csv("../input/Kannada-MNIST/train.csv")
test = pd.read_csv("../input/Kannada-MNIST/test.csv")
print(test.head())

When printing test.head(), we find the first five rows of the our test data. As mentioned on the data tab, we find that we have 784 columns of pixel data and one "id column".

==========================
OUTPUT
==========================

id  pixel0  pixel1  pixel2  pixel3  pixel4  pixel5  pixel6  pixel7  pixel8  \
0   0       0       0       0       0       0       0       0       0       0
1   1       0       0       0       0       0       0       0       0       0
2   2       0       0       0       0       0       0       0       0       0
3   3       0       0       0       0       0       0       0       0       0
4   4       0       0       0       0       0       0       0       0       0

...  pixel774  pixel775  pixel776  pixel777  pixel778  pixel779  pixel780  \
0  ...         0         0         0         0         0         0         0
1  ...         0         0         0         0         0         0         0
2  ...         0         0         0         0         0         0         0
3  ...         0         0         0         0         0         0         0
4  ...         0         0         0         0         0         0         0

pixel781  pixel782  pixel783
0         0         0         0
1         0         0         0
2         0         0         0
3         0         0         0
4         0         0         0

[5 rows x 785 columns]

With respect to data processing, we need to transform the dataframe into numpy arrays. We do this, as we then can easiliy slice the array in the wanted structure: depedent variable (y value) and our input features (x values). Using the newly build arrays, we can easily apply a tensor tranformation and use them in our keras model later on.

#Dataframe to Array
  training_array = np.array(training)

#Slicing arrays
x_train = training_array[:,1:]
y_train = training_array[:,0]

#Transforming to tensor
x_train = tf.constant(x_train)
y_train = tf.constant(y_train)

#Normalizing pixel value betweenn 0-1
x_train = tf.keras.utils.normalize(x_train, axis=1)

#We repeat these steps for the test data as well
test_array = np.array(test)
x_test = test_array[:,1:]
y_test = test_array[:,0]

x_test = tf.constant(x_test)
x_test = tf.keras.utils.normalize(x_test, axis=1)
y_test = tf.constant(y_test)

We then build the most straight forward model (Sequential). We treat the 784pixels as individual features while not keeping in mind that they actually resperent an image. In other words, we do not use a convolutional neural network here for the sake of simplicity, we stick to a Sequential model with three dense layers. In this step, we build our model, compile it and the train it.

For the model, we use one input, two hidden, and one output layer. We start of with 80 neurons and an input shape of 784 (as we have 784 pixels). We use ReLu activation for both the first and second hidden layer. In the second hidden layer, we start of with 120 neurons that receive their input from the previous layer (80 neurons). In the last (output layer), we have 10 neurons (one for each number 0-9). We use softmax as an activation function here, which will give us a probability for all output neurons individally.

In the compiling step, we will use sparse_categorical_crossentropy as our loss function as most of the pixel values (feature column values tend to be 0). Further, we use Adam as optimizer and Accuracy for evaluating the performance of our model.

We then train our model with a batch size of 100 in 9 epocs while using 20% of the data for valiadation.

model = Sequential()

model.add(Dense(units = 80, input_shape = (784,)))
model.add(Dense(units = 120, input_shape = (80,), activation='relu'))
model.add(Dense(units = 10, input_shape = (120,), activation='softmax'))
model.summary()

#Compiling
model.compile(loss = 'sparse_categorical_crossentropy', optimizer ='Adam', metrics=['acc'])

#Training
history = model.fit(x_train, y_train, verbose=1, epochs=9, batch_size = 100, validation_split = 0.2)

We find that our simple model reaches an accuracy of 0.9901 on the training set and a val_acc of 0.9766 on the validation data. This is quite good condersing that we did not use a convolutional neural network, respectivley a very very simlple model.

==========================
OUTPUT
==========================

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
dense (Dense)                (None, 80)                62800
_________________________________________________________________
dense_1 (Dense)              (None, 120)               9720
_________________________________________________________________
dense_2 (Dense)              (None, 10)                1210
=================================================================
Total params: 73,730
Trainable params: 73,730
Non-trainable params: 0
_________________________________________________________________
Train on 48000 samples, validate on 12000 samples
Epoch 1/9
48000/48000 [==============================] - 3s 73us/sample - loss: 0.4019 - acc: 0.9025 - val_loss: 0.1072 - val_acc: 0.9664
Epoch 2/9
48000/48000 [==============================] - 2s 49us/sample - loss: 0.1151 - acc: 0.9657 - val_loss: 0.0813 - val_acc: 0.9734
Epoch 3/9
48000/48000 [==============================] - 2s 48us/sample - loss: 0.0909 - acc: 0.9729 - val_loss: 0.0791 - val_acc: 0.9739
Epoch 4/9
48000/48000 [==============================] - 2s 44us/sample - loss: 0.0745 - acc: 0.9768 - val_loss: 0.0610 - val_acc: 0.9791
Epoch 5/9
48000/48000 [==============================] - 2s 44us/sample - loss: 0.0626 - acc: 0.9812 - val_loss: 0.0614 - val_acc: 0.9794
Epoch 6/9
48000/48000 [==============================] - 2s 44us/sample - loss: 0.0526 - acc: 0.9844 - val_loss: 0.0698 - val_acc: 0.9777
Epoch 7/9
48000/48000 [==============================] - 2s 44us/sample - loss: 0.0448 - acc: 0.9870 - val_loss: 0.0717 - val_acc: 0.9775
Epoch 8/9
48000/48000 [==============================] - 2s 43us/sample - loss: 0.0382 - acc: 0.9887 - val_loss: 0.0660 - val_acc: 0.9809
Epoch 9/9
48000/48000 [==============================] - 2s 43us/sample - loss: 0.0326 - acc: 0.9901 - val_loss: 0.0735 - val_acc: 0.9766

For the respective competition, we need to have our submission file in a specific format. Our first column should be called 'id', wheras our prediction column should be called 'label'. Hence, in the last step, we both rename the respective columns and make the prediction over the testing data which we will late submit to the competiton.

test_pred = pd.DataFrame(model.predict(x_test))
test_pred = pd.DataFrame(test_pred.idxmax(axis = 1))
test_pred.index.name = 'id'
test_pred = test_pred.rename(columns = {0: 'label'}).reset_index()

sub = test_pred

print(sub.head())

We then transform the newly generated list into numpy arrays.

==========================
OUTPUT
==========================

id  label
0   0      3
1   1      0
2   2      2
3   3      7
4   4      7

In the very last step, we push our submission to the output path in Kaggle.

sub.to_csv('submission.csv',index=False)

After having done this, we can commit our notebook via the commit button -- DONE!.

Commiting might take a couple of minutes.

When its finished commiting, we can immediately see our standing on the leaderboard.

We have done it. We have built a powerful ML model that allows us to predict the numbers of a 28x28 image. Now it's time to check out the leaderboard ;)

Leverage TensorFlow and Keras!

#EpicML

Projects / Apps

News

Dec 2021

--- Quantum ---

Simulating matter on the quantum scale with AI #Deepmind

Nov 2021

--- Graviton3 ---

Amazon announced its Graviton3 processors for AI inferencing - the next generation of its custom ARM-based chip for AI inferencing applications. #Graviton3

May 2021

--- Vertex AI & TPU Gen4. ---

Google announced its fourth generation of tensor processing units (TPUs) for AI and ML workloads and the Vertex AI managed platform #VertexAI #TPU

Feb 2021

--- TensorFlow 3D ---

In February of 2021, Google released TensorFlow 3D to help enterprises develop and train models capable of understanding 3D scenes #TensorFlow3D

Nov 2020

--- AlphaFold ---

In November of 2020, AlphaFold 2 was recognised as a solution to the protein folding problem at CASP14 #protein_folding

Oct 2019

--- Google Quantum ---

A research effort from Google AI that aims to build quantum processors and develop novel quantum algorithms to dramatically accelerate computational tasks for machine learning. #quantum_supremacy

Oct 2016

--- AlphaGo ---

Mastering the game of Go with Deep Neural Networks. #neural_network