Removing stop words in Java as part of data cleaning in Artificial Intelligence

The Code

Before we start off with the code, let’s look at the list of stop words we have. You can find the list here. We’ll be using this file in our code. Because this is a Spring Boot project, we’ll keep this stop words list file along with the sample essay file in the resources directory. We’ll use Spring’s ResourceUtils class to read the files. I have a method for that, which is as follows:

private List<String> getFileContentAsList(String resourceFilePath) throws IOException {

    File file = ResourceUtils.getFile(resourceFilePath);
    List<String> lines = Files.readAllLines(file.toPath());
    lines = lines.stream().map(line -> line.toLowerCase()).collect(Collectors.toList());

    return lines;

}

As you can see, we’ll keep each line of the file as a list element. In the case of our stop word file, this works just fine as there’s only one word per line. But if you look at the sample essay text file, which is the source for our project, we have whole paragraphs in a line. This will not work out for us, so we’ll have to split each line into words and make a list with each word in the file as an element. For this, I have another method which takes this list of paragraphs as the input and gives a list of words as the output. This is that method:

private List<String> splitLinesToWords(List<String> essayLines) {

    List<String> essayWords = new ArrayList<>();

    for (String line : essayLines) {
        List<String> words = Arrays.asList(line.split(" "));
        essayWords.addAll(words);
    }

    return essayWords;
}

That should take care of our data preparation phase. But if you observe the getFileContentAsList() method, we’re giving resourceFilePath as the input. Because we’re using Spring Boot, it’s a bit different than the usual way we think of file paths:

private static final String resourceFilePath = "classpath:englishStopWords.txt";
private static final String sampleEssayFilePath = "classpath:sampleEssay.txt";

We append the string “classpath:” to the filename, which means that the file is in the class path. This will automatically look for the file in the resources directory. Using these file paths, we can read the contents of these files into variables this way:

List<String> englishStopWords = getFileContentAsList(resourceFilePath);
List<String> essayLines = getFileContentAsList(sampleEssayFilePath);

After reading the file contents to the essayLines variable, we’ll convert this list of paragraphs to list of words by calling the method we already looked at:

List<String> essayWords = splitLinesToWords(essayLines);

Once we have this much done, all we have to do is, using Array functions natively available in Java, remove all the words from the essayWords list which are present in the englishStopWords list. We can do that with just one statement like this:

essayWords.removeAll(englishStopWords);

Testing if the code works

We can validate this by either printing the whole list, or by looking at the size of the list before and after removing all stop words:

long wordCountBeforeRemovingStopWords = essayWords.size();

 essayWords.removeAll(englishStopWords);

long wordCountAfterRemovingStopWords = essayWords.size();

logger.info("wordCountBeforeRemovingStopWords: " + wordCountBeforeRemovingStopWords);
logger.info("wordCountAfterRemovingStopWords: " + wordCountAfterRemovingStopWords);

In our example, we’ll get the following output:

wordCountBeforeRemovingStopWords: 1058
wordCountAfterRemovingStopWords: 506

As you can see, we’ve removed over 500 stop words. We can use the remaining words to form sentences again and continue with the analysis. The whole class looks like this:

@SpringBootApplication
public class App implements CommandLineRunner {

    private static final Logger logger = LoggerFactory.getLogger(App.class);

    private static final String resourceFilePath = "classpath:englishStopWords.txt";
    private static final String sampleEssayFilePath = "classpath:sampleEssay.txt";

    public static void main(String[] args) {
        SpringApplication.run(App.class, args);
    }

    @Override
    public void run(String... args) throws Exception {

        try {

            List<String> englishStopWords = getFileContentAsList(resourceFilePath);

            logger.info("Stop words");
            logger.info(englishStopWords.toString());

            // I copied this essay from this place: https://www.apstudynotes.org/english/sample-essays/definition-success/
            List<String> essayLines = getFileContentAsList(sampleEssayFilePath);

            List<String> essayWords = splitLinesToWords(essayLines);

            long wordCountBeforeRemovingStopWords = essayWords.size();

            essayWords.removeAll(englishStopWords);

            long wordCountAfterRemovingStopWords = essayWords.size();

            logger.info("wordCountBeforeRemovingStopWords: " + wordCountBeforeRemovingStopWords);
            logger.info("wordCountAfterRemovingStopWords: " + wordCountAfterRemovingStopWords);

            logger.info("Essay after removing stop words: ");
            logger.info(essayWords.toString());

        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    private List<String> splitLinesToWords(List<String> essayLines) {

        List<String> essayWords = new ArrayList<>();

        for (String line : essayLines) {
            List<String> words = Arrays.asList(line.split(" "));
            essayWords.addAll(words);
        }

        return essayWords;
    }

    private List<String> getFileContentAsList(String resourceFilePath) throws IOException {

        File file = ResourceUtils.getFile(resourceFilePath);
        List<String> lines = Files.readAllLines(file.toPath());
        lines = lines.stream().map(line -> line.toLowerCase()).collect(Collectors.toList());

        return lines;

    }
}

This is one of the simplest ways of removing stop words in Java. You can do something very similar in Python as well. Let me know in the comments below if you’re interested in the Python code as well. If you want to get right into the project and start messing with the text files, you can fork my Github repo and get started. And if you want to know more about how to get started with NLP, checkout my fastText series, where I demonstrate how to use Facebook’s fastText library for doing some basic NLP operations.

And if you like what you see here, or on my Medium blog, and would like to see more of such helpful technical posts in the future, consider supporting me on Patreon and Github.

Become a Patron!

Removing stop words in Java as part of data cleaning in Artificial Intelligence

The Code

Testing if the code works

Like this:

Related

Leave a Reply Cancel reply

The Code

Testing if the code works

Share this:

Like this:

Related

Leave a Reply Cancel reply