Home>>Data Science>>Removing stop words in Java as part of data cleaning in Artificial Intelligence
Data Science

Removing stop words in Java as part of data cleaning in Artificial Intelligence

More in The fastText Series.

Working with text datasets is very common in data science problems. A good example of this is sentiment analysis, where you get social network posts as data sets. Based on the content of these posts, you need to estimate the sentiment around a topic of interest. When we’re working with text as the data, there are a lot of words which we want to remove from the data to “clean” it, such as normalising, removing stop words, stemming, lemmatizing, etc. In this post, we’ll see how we can remove stop words from our input text to clean our data so that our analysis is based only on the actual content of the data.

But wait, what are stop words? Stop words are common words in the English language which can be removed from a sentence without having any major impact on the efficiency of the model or algorithm. Examples of such words include “is”, “a”, “the”, “in”, “on”, etc. There is no universally accepted list of stop words which is used by all Natural Language Processing (NLP) libraries. But there are lists maintained by different organisations that are used in many libraries. In this example, I’m using a list which is a union of many such lists. However, I don’t remember where I got the list from.

In this post, we’ll see how we can remove such stop words from a given text file, with a very simple code in Java. The sample text I’m using in this example is an essay from this source. You can use an input text of your own choice to experiment.


The Code

Before we start off with the code, let’s look at the list of stop words we have. You can find the list here. We’ll be using this file in our code. Because this is a Spring Boot project, we’ll keep this stop words list file along with the sample essay file in the resources directory. We’ll use Spring’s ResourceUtils class to read the files. I have a method for that, which is as follows:

private List<String> getFileContentAsList(String resourceFilePath) throws IOException {

    File file = ResourceUtils.getFile(resourceFilePath);
    List<String> lines = Files.readAllLines(file.toPath());
    lines = lines.stream().map(line -> line.toLowerCase()).collect(Collectors.toList());

    return lines;

}

As you can see, we’ll keep each line of the file as a list element. In the case of our stop word file, this works just fine as there’s only one word per line. But if you look at the sample essay text file, which is the source for our project, we have whole paragraphs in a line. This will not work out for us, so we’ll have to split each line into words and make a list with each word in the file as an element. For this, I have another method which takes this list of paragraphs as the input and gives a list of words as the output. This is that method:

private List<String> splitLinesToWords(List<String> essayLines) {

    List<String> essayWords = new ArrayList<>();

    for (String line : essayLines) {
        List<String> words = Arrays.asList(line.split(" "));
        essayWords.addAll(words);
    }

    return essayWords;
}

That should take care of our data preparation phase. But if you observe the getFileContentAsList() method, we’re giving resourceFilePath as the input. Because we’re using Spring Boot, it’s a bit different than the usual way we think of file paths:

private static final String resourceFilePath = "classpath:englishStopWords.txt";
private static final String sampleEssayFilePath = "classpath:sampleEssay.txt";

We append the string “classpath:” to the filename, which means that the file is in the class path. This will automatically look for the file in the resources directory. Using these file paths, we can read the contents of these files into variables this way:

List<String> englishStopWords = getFileContentAsList(resourceFilePath);
List<String> essayLines = getFileContentAsList(sampleEssayFilePath);

After reading the file contents to the essayLines variable, we’ll convert this list of paragraphs to list of words by calling the method we already looked at:

List<String> essayWords = splitLinesToWords(essayLines);

Once we have this much done, all we have to do is, using Array functions natively available in Java, remove all the words from the essayWords list which are present in the englishStopWords list. We can do that with just one statement like this:

essayWords.removeAll(englishStopWords);

Testing if the code works

We can validate this by either printing the whole list, or by looking at the size of the list before and after removing all stop words:

long wordCountBeforeRemovingStopWords = essayWords.size();

 essayWords.removeAll(englishStopWords);

long wordCountAfterRemovingStopWords = essayWords.size();

logger.info("wordCountBeforeRemovingStopWords: " + wordCountBeforeRemovingStopWords);
logger.info("wordCountAfterRemovingStopWords: " + wordCountAfterRemovingStopWords);

In our example, we’ll get the following output:

wordCountBeforeRemovingStopWords: 1058
wordCountAfterRemovingStopWords: 506

As you can see, we’ve removed over 500 stop words. We can use the remaining words to form sentences again and continue with the analysis. The whole class looks like this:

@SpringBootApplication
public class App implements CommandLineRunner {

    private static final Logger logger = LoggerFactory.getLogger(App.class);

    private static final String resourceFilePath = "classpath:englishStopWords.txt";
    private static final String sampleEssayFilePath = "classpath:sampleEssay.txt";

    public static void main(String[] args) {
        SpringApplication.run(App.class, args);
    }

    @Override
    public void run(String... args) throws Exception {

        try {

            List<String> englishStopWords = getFileContentAsList(resourceFilePath);

            logger.info("Stop words");
            logger.info(englishStopWords.toString());

            // I copied this essay from this place: https://www.apstudynotes.org/english/sample-essays/definition-success/
            List<String> essayLines = getFileContentAsList(sampleEssayFilePath);

            List<String> essayWords = splitLinesToWords(essayLines);

            long wordCountBeforeRemovingStopWords = essayWords.size();

            essayWords.removeAll(englishStopWords);

            long wordCountAfterRemovingStopWords = essayWords.size();

            logger.info("wordCountBeforeRemovingStopWords: " + wordCountBeforeRemovingStopWords);
            logger.info("wordCountAfterRemovingStopWords: " + wordCountAfterRemovingStopWords);

            logger.info("Essay after removing stop words: ");
            logger.info(essayWords.toString());

        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    private List<String> splitLinesToWords(List<String> essayLines) {

        List<String> essayWords = new ArrayList<>();

        for (String line : essayLines) {
            List<String> words = Arrays.asList(line.split(" "));
            essayWords.addAll(words);
        }

        return essayWords;
    }

    private List<String> getFileContentAsList(String resourceFilePath) throws IOException {

        File file = ResourceUtils.getFile(resourceFilePath);
        List<String> lines = Files.readAllLines(file.toPath());
        lines = lines.stream().map(line -> line.toLowerCase()).collect(Collectors.toList());

        return lines;

    }
}

This is one of the simplest ways of removing stop words in Java. You can do something very similar in Python as well. Let me know in the comments below if you’re interested in the Python code as well. If you want to get right into the project and start messing with the text files, you can fork my Github repo and get started. And if you want to know more about how to get started with NLP, checkout my fastText series, where I demonstrate how to use Facebook’s fastText library for doing some basic NLP operations.

And if you like what you see here, or on my Medium blog, and would like to see more of such helpful technical posts in the future, consider supporting me on Patreon and Github.

Become a Patron!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: