My First Kaggle Competition: Titanic

For more practical experience with gradient descent, I decided to participate in the Titanic Competition. Here is how I did it and what I learned.

I took the following approach:

Setting up my local machine for the Kaggle competition
Trying to write as little code as possible, I implemented notebook which leveraged a fast.ai tabular learner
Creating another notebook re-implementing Jeremy’s Excel-based model

Note: This blog post itself is a notebook, and it can be found here on GitHub.

Installing Kaggle

Getting ready for the Kaggle competition requires registering for the competition (a few clicks on the kaggle website), and installing kaggle on your local machine. The following is based on the Live-Coding Session 7 and the related official topic in the forums.

The first step is to install kaggle:

pip install --user kaggle

As a result, the following warning is displayed: The script kaggle is installed in '/home/<your user>/.local/bin' which is not on PATH. This means that the you need to add the path to the PATH-variable. This is done by adding the following line to the .bashrc-file and restarting the terminal:

PATH=~/.local/bin:$PATH

Note: To display the current PATH-variable use: echo $PATH

As a result, typing the kaggle-command on the command line works, but the next error shows up (as expected): OSError: Could not find kaggle.json. Make sure it's located in /home/chrwittm/.kaggle. Or use the environment method.

This means that you cannot authorize against the kaggle platform. To solve this, download your personal kaggle.json On the kaggle website, navigate to: “Account” and click on “Create New API Token”. As a result, the kaggle.json is downloaded.

Copy the kaggle.json-file into the .kaggle-directory in your home directory.

Typing the kaggle-command on the command line gives you the final clue as to what is missing: Your Kaggle API key is readable by other users on this system! To fix this, you can run 'chmod 600 /home/chrwittm/.kaggle/kaggle.json'

Therefore, type:

chmod 600 /home/<your user>/.kaggle/kaggle.json

Typing the kaggle-command on the command line again confirms: We are in business :)

Downloading the dataset

To download the dataset, run the following command (which you can also find on the kaggle website):

kaggle competitions download -c titanic

As a result, the file titanic.zip is downloaded.

To unzip type:

unzip titanic.zip

Doing this for the first time, this resulted in an error: /bin/bash: unzip: command not found

To install zip and unzip, type:

sudo apt-get install zip
sudo apt-get install unzip

As a result, unzipping works, and we have a dataset to work with :).

import pandas as pd

train = pd.read_csv("train.csv")
train.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

Implementing a Fast.ai Tabular Learner

The goal was not to create a perfect submission, but to simply train a model as fast as possible to

get a baseline
to get to know how a kaggle competition works (remember, this is my first one)

Therefore, I created a dataloaders as shown in lesson 1 or in the docs by sorting the variables into categorical or continuos one, excluding irrelevant ones).

Note 1: In this blog post, I am presenting the steps in a fast-forward way, here is the original notebook.

Note 2: When writing this up, I was not able to 100% re-produce the same results, but basically this is how the story went.

from fastai.tabular.all import *

path = "."

dls = TabularDataLoaders.from_csv('train.csv', path=path, y_names="Survived",
    cat_names = ['Pclass', 'Sex', 'SibSp', 'Parch', 'Embarked'],
    cont_names = ['Age', 'Fare'],
    procs = [Categorify, FillMissing, Normalize])

Now we can train a model:

learn = tabular_learner(dls, metrics=accuracy)
learn.fit_one_cycle(10) #change this variable for more/less training

epoch	train_loss	valid_loss	accuracy	time
0	0.548652	0.315984	0.640449	00:00
1	0.454461	0.325496	0.640449	00:00
2	0.373511	0.289948	0.640449	00:00
3	0.319270	0.251090	0.640449	00:00
4	0.280473	0.196879	0.640449	00:00
5	0.249269	0.173640	0.640449	00:00
6	0.225535	0.152192	0.640449	00:00
7	0.207350	0.141283	0.640449	00:00
8	0.192223	0.137462	0.640449	00:00
9	0.180697	0.137344	0.640449	00:00

With this learner, we can make the predictions on the test-dataset.

test = pd.read_csv("test.csv")

# replacing null values with 0
test['Fare'] = test['Fare'].fillna(0)

# create Predictions as suggested here:
# https://forums.fast.ai/t/tabular-learner-prediction-using-data-frame/90534/2
test_dl = learn.dls.test_dl(test)
preds, _ = learn.get_preds(dl=test_dl)

test['Survived_pred'] = preds.squeeze()
test.head()

	PassengerId	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked	Survived_pred
0	892	3	Kelly, Mr. James	male	34.5	0	0	330911	7.8292	NaN	Q	0.064765
1	893	3	Wilkes, Mrs. James (Ellen Needs)	female	47.0	1	0	363272	7.0000	NaN	S	0.454887
2	894	2	Myles, Mr. Thomas Francis	male	62.0	0	0	240276	9.6875	NaN	Q	-0.025921
3	895	3	Wirz, Mr. Albert	male	27.0	0	0	315154	8.6625	NaN	S	-0.015690
4	896	3	Hirvonen, Mrs. Alexander (Helga E Lindqvist)	female	22.0	1	1	3101298	12.2875	NaN	S	0.508172

Interpreting the values in column Survived_pred is important, because we need to turn these values into 0 and 1 for the submission. The submission file should only have the columns PassengerId and Survived. For the first submission, I did not worry about it too much and simply picked a value 0.5. (Let’s come back to that a little later)

threshold = 0.5 #change this variable for more/less training
test['Survived'] = [ 1 if element > threshold else 0 for element in preds.squeeze()]

submission1 = test[['PassengerId', 'Survived']]
submission1.to_csv('submission1.csv', index=False)

I uploaded the results, and they were better then random ;) - Score 0.73923

The score is not great, but the whole point was to get a baseline as quickly as possible, and to “play the whole kaggle game”. Actually, the fact that I produced this result in about 1-2 hours felt pretty good :).

Note: Running this notebook, I got a score of 0.75119, I am not sure, what caused the difference… but better is always good ;)

So how can we improve the score? More training, interpreting the results differently? As it turns out: Both.

Let’s look at the distribution of Survived_pred:

test.Survived_pred.hist();

As it turned out, setting my threshold to 0.6 created a better result: Score: 0.74162. (this I could not reproduce with this notebook while writing up the blog post)

Also more training, produced better results, running for 50 cycles, resulted in a lower loss and a better result. Training with 50 cycles and threshold 0.7, this was the result: Score: 0.76794 (with this notebook 0.77033)

So there is some randomness when training, and it is important to properly interpret the results. Getting about 77% right with this simple approach is not to bad.

Re-Implementing the Excel Model

After the quick win with Fast.AI, I decided to re-implement what Jeremy did in the Excel in video lecture 3 to predict the survivors. Let’s see how it performs against the Fast.AI tabular learner.

Since that involved quite a bit of code, let me simply link to notebook and discuss the learnings / results.

As it turned out:

I had to do a bit of data cleansing.
The feature engineering took some time which taught me some general python lessons.
Implementing the optimizer was a nice exercise, revisiting gradient descent and matrix multiplication, and doing some hands-on work with tensors.

The first model with just one layer scored 0.75837, even better than the my Fast.AI baseline, but not quite as good as the optimized version.

The next iteration with 2 and 3 layers scored better:

Score: 0.77033 (2-layers)
Score: 0.77272 (3-layers)

This was quite surprising: The self-written algorithm is better than the Fast.AI one, any ideas why that would be?

Nonetheless, it seems to hit a ceiling at 77%, and it would make sense to dive deeper into tabular data, but that is for another time. My goal was not to optimize the competition result, but to participate in my first kaggle competition, and to re-visit the topic of gradient descent and matrix multiplication. I will most likely return to this dataset/challenge in the future.

Reuse

CC BY 4.0