Lemmatization is one of the most common text pre-processing techniques used in Natural Language Processing (NLP) and machine learning in general. If you’ve already read my post about stemming of words in NLP, you’ll already know that lemmatization is not that much different. Both in stemming and in lemmatization, we try to reduce a given word to its root word. The root word is called a stem in the stemming process, and it is called a lemma in the lemmatization process. But there are a few more differences to the two than that. Let’s see what those are.
How is Lemmatization different from Stemming
In stemming, a part of the word is just chopped off at the tail end to arrive at the stem of the word. There are definitely different algorithms used to find out how many characters have to be chopped off, but the algorithms don’t actually know the meaning of the word in the language it belongs to. In lemmatization, on the other hand, the algorithms have this knowledge. In fact, you can even say that these algorithms refer a dictionary to understand the meaning of the word before reducing it to its root word, or lemma.
So, a lemmatization algorithm would know that the word better is derived from the word good, and hence, the lemme is good. But a stemming algorithm wouldn’t be able to do the same. There could be over-stemming or under-stemming, and the word better could be reduced to either bet, or bett, or just retained as better. But there is no way in stemming that it could be reduced to its root word good. This, basically is the difference between stemming and lemmatization.
Advantages and Disadvantages of Lemmatization
As you could probably tell by now, the obvious advantage of lemmatization is that it is more accurate. So if you’re dealing with an NLP application such as a chat bot or a virtual assistant where understanding the meaning of the dialogue is crucial, lemmatization would be useful. But this accuracy comes at a cost.
Because lemmatization involves deriving the meaning of a word from something like a dictionary, it’s very time consuming. So most lemmatization algorithms are slower compared to their stemming counterparts. There is also a computation overhead for lemmatization, however, in an ML problem, computational resources are rarely a cause of concern.
Should you choose Lemmatization over Stemming?
Well, I can’t answer that question. Lemmatization and stemming are both much more complex than what I’ve made them appear here. There are lot more things to consider about both the approaches before making a decision. But I have rarely seen any significant improvement in efficiency and accuracy of a product which is using lemmatization over stemming. In most cases, at least according to my knowledge, the overhead that lemmatization demands is not justified. So it depends on the project in question. But I want to put out a disclaimer here, most of the work I have done in NLP is for text classification, and that is where I haven’t see any significant difference. There are applications where the overhead of lemmatization is perfectly justified and in fact, lemmatization would be a necessity.
Become a Patron!