How pyThaiNLP's thai2fit Outperforms Google's BERT: State-of-the-Art Thai Text Classification and Beyond

By Charin

Elevator Pitch

Google’s TPU-trained BERT made the headlines when it claimed state-of-the-art text classification results in multiple languages, but not Thai. This is the story of how our rag-tag group of open-source coders managed to outperform Google with our very own Thai text classification model, thai2fit.

Description

Transfer learning is arguably one of the most influential concept that happened to machine learning, so much so that some called 2012 when its first mainstream application to computer vision the turning point of deep learning–the ImageNet moment. For NLP, this happened in 2018 when Jeremy Howard and Sebastian Ruder introduced ULMFit, pretraining a language model on Wikipedia texts then using the embeddings for text classification. Many similar models followed, among them is Google’s TPU-trained Bidirectional Encoder Representations from Transformers or BERT, which quickly claimed state-of-the-art in many NLP tasks such as classification and question answering. Lesser known than BERT is thai2fit (formerly thai2vec), a ULMFit model trained on Thai Wikipedia for Thai text classification earlier the same year as part of the open-source pyThaiNLP library. To everyone’s surprise, especially its author, thai2fit fended off benchmark performances from such corporate-sponsored models as Facebook’s fastText and Google’s BERT and remained the state-of-the-art for Thai text classification.

pyThaiNLP is a rag tag bunch of open source contributors, founded by a then-18-year-old high schooler Tontan Wannaphong Phatthiyaphaibun, who “just wanted to make a simple chatbot” (verbatim) and ended up creating the most starred Thai natural language processing repository. In the beginning of 2018, I came into the project due to necessity, hoping to just create a word2vec equivalent of Thai language for my day job. A few months later, I ended up adapting ULMFit to Thai language and created a text classification benchmark on the only large-scale, public text classification dataset available, the wongnai-corpus.

This is the journey from when everyone’s only comment about Thai NLP has been that it is “impossible to do” because “we cannot cut Thai words”, to when it is proven that even with the simplest word segmentation model that can only achieve 70% accuracy can give you the text classification performance that rivals world-class models. But there is more: text classification is not the goal; it is the beginning. We will explore together what state-of-the-art transfer learning algorithms in NLP can do from writing human-readable essays based on a few topics, similar to OpenAI’s GPT-2, to conversational chatbots, to generating financial and medical reports based on charts and tables.

The ImageNet moment for NLP is upon us, and at pyThaiNLP, it is our number one priority to not let Thai language got left behind.

Notes

I am one of the main contributors of pyThaiNLP and author of thai2fit. Our claim to the state-of-the-art result is based on wongnai-corpus which is to the best of our knowledge the only current large-scale benchmark for Thai text classification. We are planning to expand the benchmarks to other datasets and might be able to get it done before the event (See progress here). My credentials can be verified both by all sources cited and also by reference person namely maintainers of the wongnai-corpus and Ekapol Chuangsuwanich, professor of Computer Engineering at Chulalongkorn University who helped create the corpus.