Stemming is one of the most common data pre-processing operations we do in almost all Natural Language Processing (NLP) projects. If you’re new to this space, it is possible that you don’t exactly know what this is even though you have come across this word. You might also be confused between stemming and lemmatization, which are two similar operations. In this post, we’ll see what exactly is stemming, with a few examples here and there. I hope I’ll be able to explain this process in simple words for you.
To put simply, stemming is the process of removing a part of a word, or reducing a word to its stem or root. This might not necessarily mean we’re reducing a word to its dictionary root. We use a few algorithms to decide how to chop a word off. This is, for the most part, how stemming differs from lemmatization, which is reducing a word to its dictionary root, which is more complex and needs very high degree of knowledge of a language. We’ll talk about lemmatization in another post, maybe. For this post, we’ll stick to stemming and see a few examples.
Let’s assume we have a set of words – send, sent and sending. All the three words are different tenses of the same root word send. So after we stem the words, we’ll have just the one word – send. Similarly, if we have the words – ask, asking and asked – we can apply stemming algorithms to get the root word – ask. Stemming is as simple as that. But (there’s always a but), unfortunately, it’s not as simple as that. We will some times have complications. And these complications are called over stemming and under stemming. Let’s see more about them in the next sections.
Over stemming is the process where a much larger part of a word is chopped off than what is required, which in turn leads to two or more words being reduced to the same root word or stem incorrectly when they should have been reduced to two or more stem words. For example, university and universe. Some stemming algorithms may reduce both the words to the stem univers, which would imply both the words mean the same thing, and that is clearly wrong. So we have to be careful when we select a stemming algorithm, and when we try to optimize the model. As you can imagine, under stemming is the opposite of this.
In under stemming, two or more words could be wrongly reduced to more than one root word, when they actually should be reduced to the same root word. For example, consider the words “data” and “datum.” Some algorithms may reduce these words to dat and datu respectively, which is obviously wrong. Both of these have to be reduced to the same stem dat. But trying to optimize such models might in turn lead to over stemming as well. So we have to be very careful when we’re dealing with stemming.
I hope this was helpful in understanding what is stemming and the two different errors in stemming. If there’s still any confusion about this, please let me know in the comments below and I’ll try to clear any doubt you have.
And if you like what you see here, or on my Medium blog, and would like to see more of such helpful technical posts in the future, consider supporting me on Patreon, where I have some amazing stuff, such as one-on-one sessions and help with your projects.Become a Patron!