More in The fastText Series.
Text classification is a pretty common application of machine learning. In such an application, machine learning is used to categorise a piece of text into two or more categories. There are both supervised and unsupervised learning models for text classification. In this post, we’ll see how we can use Facebook’s fastText library for some simple text classification.
fastText, developed by Facebook, is a popular library for text classification. The library is an open source project on GitHub, and is pretty active. The library also provides pre-built models for text classification, both supervised and unsupervised. In this post, we’ll check out how we can train the supervised model in the library for some quick text classification. The library can be used as a command line tool, or as a Python package. In order to keep things very simple, we’ll just a see a few CLI commands in this post.
Installing fastText for command line is as simple as cloning the Git repo and running the make command inside the directory:
git clone https://github.com/facebookresearch/fastText.git
Once you do this, you have the fastText CLI installed, provided you don’t get any errors. You can also install the Python library by running the following command from the same directory:
pip install .
You can verify the installation by running the following command:
You should see something like this in your terminal:
usage: fasttext <command> <args>The commands supported by fasttext are:supervised train a supervised classifier
quantize quantize a model to reduce the memory usage
test evaluate a supervised classifier
test-label print labels with precision and recall scores
predict predict most likely labels
predict-prob predict most likely labels with probabilities
skipgram train a skipgram model
cbow train a cbow model
print-word-vectors print word vectors given a trained model
print-sentence-vectors print sentence vectors given a trained model
print-ngrams print ngrams given a trained model and word
nn query for nearest neighbors
analogies query for analogies
dump dump arguments,dictionary,input/output vectors
This indicates that you have the tool installed. The next step is to get our dataset.
Getting the data
Facebook developers have included a dataset for testing out this library. So we’ll use the same data. It’s a collection of stackexchange questions about cooking. The aim here is to classify the questions automatically. Because we’re using supervised learning, we have to make sure that we mark the categories for the questions we have in the data. And again, fortunately, the data comes with the categories already marked. So let’s download the data first. The data is available as a compressed file here: https://dl.fbaipublicfiles.com/fasttext/data/cooking.stackexchange.tar.gz. We can download the data and unzip it manually, or we can just run the following command to do it from the CLI:
wget https://dl.fbaipublicfiles.com/fasttext/data/cooking.stackexchange.tar.gz && tar xvzf cooking.stackexchange.tar.gz
We’ll have a few files in the directory once we unzip it. But our data is in a file named cooking.stackexchange.txt. If you open up the file or head it, you’ll see something like this:
__label__sauce __label__cheese How much does potato starch affect a cheese sauce recipe?
__label__food-safety __label__acidity Dangerous pathogens capable of growing in acidic environments
__label__cast-iron __label__stove How do I cover up the white spots on my cast iron stove?
__label__restaurant Michelin Three Star Restaurant; but if the chef is not there
__label__knife-skills __label__dicing Without knife skills, how can I quickly and accurately dice vegetables?
__label__storage-method __label__equipment __label__bread What’s the purpose of a bread box?
__label__baking __label__food-safety __label__substitutions __label__peanuts how to seperate peanut oil from roasted peanuts at home?
__label__chocolate American equivalent for British chocolate terms
__label__baking __label__oven __label__convection Fan bake vs bake
__label__sauce __label__storage-lifetime __label__acidity __label__mayonnaise Regulation and balancing of readymade packed mayonnaise and other sauces
As you can see, there’s something unusual with the text. In each line, we have the __label__ text. This is actually the category of the question. So the way you specify the category in the training data is to include the text __label__ followed by the category. We can specify multiple categories for a question as well, as can be seen from the sample above. Now that we have the data ready, let’s split it to training and test data.
Splitting the data into training and test data
Before we begin training our model, we have to split the data so that we have one dataset for training the model and one for testing the accuracy of the model. If you want to know why this is necessary or how to do it in Python, you can read my post about it here. Usually, we do an 80–20 split of the data, 80% of the data for training, and 20% for testing. To do this, we first need to see how many questions we have. There’s one question per line, so getting the number of lines in the file will do it:
$ wc -l cooking.stackexchange.txt
As we can see from the output, we have 15404 lines in the file. 80% of that is 12323.2, so we’ll take the first 12324 lines as the training dataset. And the rest 3080 lines will be our test data. To do this split, we’ll run the following commands:
$ head -n 12324 cooking.stackexchange.txt > training_data.txt
$ tail -n 3080 cooking.stackexchange.txt > testing_data.txt
We’ll now have two new files, one for training and one for testing. Next, we’ll train the model with our training data.
Training the model
This is actually a very simple command with this library. We just run the fastText CLI tool with the supervised command and provide the input file (which is our training data file), and a name for the model which will be generated. The command for this looks like this:
./fasttext supervised -input training_data.txt -output cooking_question_classification_model
As you can see, that’s a pretty easy command to understand. The -input option specifies the input file and the -output option specifies the name of the model which will be generated. Once you run this command, you should an output something like this:
$ ./fasttext supervised -input training_data.txt -output cooking_question_classification_model
Read 0M words
Number of words: 14492
Number of labels: 735
Progress: 100.0% words/sec/thread: 47404 lr: 0.000000 avg.loss: 10.243105 ETA: 0h 0m 0s
Our model is now trained, and ready to take some questions for classification. Let’s try that now.
Testing our model with some questions
When we trained our model in the previous step, the command generated a couple of new files: cooking_question_classification_model.bin and cooking_question_classification_model.vec. The .bin file, or the binary file of the model, is what we’ll use now. We can start testing the model by running the following command:
./fasttext predict cooking_question_classification_model.bin -
As you can see, we use the predict command to tell our model that we’re going to do some predictions now. The dash (-) at the end of the command indicates that we’re going to type the questions in the command line. We can also give the command a file with multiple questions, but we’ll keep that for the next post. Once you run this command, the model will start listening for questions, and you can type in a question and hit the Enter or return key to get a prediction. Let’s try that now:
how to bake a cake
how to cook some rice
Does milk make cakes lighter or tougher?
How to prepare dried tortellini?
What exactly is a chowder?
How bad is it to freeze vegetables in glass for a maximum of 4 weeks?
Each of my question is answered with the text __label__ followed by the category that the model thinks the question belongs to. As you can already tell, the model didn’t get all the answers or the categories right. This is expected, as we haven’t really cleaned our data or tuned the model. We can fine tune the model by pre-processing our data a bit, to make it a bit more cleaner for the model to understand.
This example is what you’ll encounter the first time you try out fastText. You can checkout the fastText website for more info about the library and this tutorial. Because I wanted to limit this post to the basics, I’m not going to talk about validating the model or pre-processing our data for better results.
Another way we can get better results is by using n-grams and looking at the sentences as sequences of words. Once we start looking at words in this way, we’ll be able to understand a lot more about the patterns in the data we trained on and better our predictions. I’ll write another post about what is n-gram (because it deserves a post of it’s own) and how we can use probability to understand our data better.