Re-working Fast.AI lesson 4, I transferred the approach from Jeremy’s notebook “Getting started with NLP for absolute beginners” to the Kaggle competition “Natural Language Processing with Disaster Tweets”.
When I started this project, I did not expect it to become such an extended endeavor. It introduced me to many different aspects of natural language processing in particular and machine learning in general. To share what I learned with the community, I recorded my approach and the key learnings in this blog post.
In the spirit of producing results quickly and training models early in the development process:
The key learnings:
More details are in my blog post.
Just when I thought I was done with disaster tweets, I realized I forgot a topic I wanted to cover. In a new notebook version, I implemented a confusion matrix to find tweets which are incorrectly labeled in the training set - basically the same approach as looking for top losses (for example in lesson 2).
I was indeed successful in finding quite a few incorrectly labeled tweets, but (surprisingly) this did not result in a better overall competition result - from my understanding this is a limitation of the dataset. I summarized the full story and my learnings in this blog post.
Summing it all up, I also write this kaggle discussion forum post.