Stopwords

A crucial aspect of NLP involves addressing "stop words." These are the words that occur frequently in a text but do not often provide significant insights on their own. For example, words such as "the," "and," and "I," while commonplace, typically do not contribute meaningful information about the specific topic of a document, hence the name stopwords. Eliminating these words help us better identify unique and relevant terms.

It's important to emphasize that there is no universally agreed-upon list of stop words in the field of NLP, each framework offers its own list of stop words. In the following, we will explore Natural Language Toolkit (NLTK) stop words list and compare it to that of WordCloud, a popular python library for word cloud plotting.

If you don't have the packages installed in your computer, simply run the following code on your termianl/command window:

pip install nltk wordcloud

NLTK’s Stopword List

In order to access NLTK's stopwords we first need to download the stopwords package:

import nltk
nltk.download('stopwords')

A graphical user interface (GUI) will appear when you are prompted. Select “All” and then click “Download”. This may take some time, so may want to grab a coffee (or tea) while waiting.

Once, it is downloaded, you can access the package via:

from nltk.corpus import stopwords as sw

To see the list of languages NLTK provides stopword:

print(len(sw.fileids()))
print(sw.fileids())
29
['arabic', 'azerbaijani', 'basque', 'bengali', 'catalan', 'chinese', 'danish', 'dutch',
 'english', 'finnish', 'french', 'german', 'greek', 'hebrew', 'hinglish', 'hungarian',
 'indonesian', 'italian', 'kazakh', 'nepali', 'norwegian', 'portuguese', 'romanian',
 'russian', 'slovene', 'spanish', 'swedish', 'tajik', 'turkish']

At the time of this writing, NLTK supports stopwords in 29 languages.To access the stopwords for a given language we use words module. Lets check the number of stopwords NLTK provides per language:

English language has 179 stopwords in NLTK at the

The Top3 languages with the highest number of stopwords:

  • Slovene - 1784

  • Hinglish (a mix of Hindi and English) - 1036

  • Arabic - 754

Let's take a closer look at the stopwords in 'English'.

WordCloud's Stopword List

We can access the list of WordCloud's English stopwords by simply importing the STOPWORD module

The output shows us that WordCloud provides us with 192 stopwords in English. Now, it is time to compare both lists!

Stopwords: NLTK vs WordCloud

Though, there are multiple ways to extract the words unique to each library, arguably the easiest way to do it is through Numpy' s setdiff1d function, which finds the set difference of two arrays. Let's begin by importing the numpy library:

The following code snippet will first create those 'difference' arrays then print the number of unique stopwords and the stopwords themselves, respectively per library:

Output:

The results show that there are 35 stopwords exist only in NLTK, and 48 stopwords only in Wordcloud libraries. Let's create a combined unique list of stopwords from both packages:

Now, it is time to put our stopwords in action!

Filtering Stopwords in Text

For this part of the exercise we will be using NLTK's Brown Corpus, which is a well-known text corpus that contains a collection of texts from various domains/categories, making it a great resource for language analysis.

Before we begin, make sure you have NLTK Brown Corpus downloaded. if you haven't already downloaded, you can do so by using the following

To access the Brown Corpus, you can use the nltk.corpus module:

We will be working with news cateogry for this exercise

We will now create a frequency distribution of words in a our selected category to show why stopword removal is important:

Notice that NLTK treats functions as words as well and the output shows us that the most common 15 words in our news category, which consists of 100,554 words, are either so called the stopwords or the punctuations! Using our sw_cobined, we now filter the stopwords from our text:

We use lower case for each word, w.lower(), to make sure 1) when calculate the frequency of a word we should ignore the case status to have the correct counts, 2) because our combined list only consists of lower case words, we need to make sure that we also convert each word before checking its existence in the stopwords list.

Time to create a frequency distribution without the stopwords and display the most common words after filtering:

The output now displays a different set of words. Removal of words like "mr.", "mrs.", or "one" and "two" will depend on the insights we are trying to extract from our analysis, we can either leave them as is or add them to our combined list of stopwords and filter them out. However, unarguably, punctuations cause noise in our dataset and should be removed before furthering in our project. NLTK currently doesn't provide a module for punctions and generally we use python's string module instance of punctuation to obtain all punctions. (You can manually enter the punctuations too)

Though, punction instance helped us remove most of the punctions, we still see '``', "''", '--' remained in our most common word list. In such situations it is best to do manual filtering:

Now, we have a list that consists of only the most common words along with their frequencies.

In this exercise, we explored common words, called stopwords, that occur often in a text but do not provide us significant insights. We also saw that different libraries like NLTK and WorldCloud have their own stopword lists and that there is no such thing as an ultimate list for stopwords. We then filter out those words from one of the categories in NLTK's Brown corpus. Exploring word frequencies in the dataset showed us that removal of stopwords and the punctuations is an iterative process. This brings us to the gist of this exercise that when we work with text data, we need to be aware of the fact that stopword list we are given is limited and we can always combine multiple lists or even create our own stopword list specific to the needs of our project, and the filtering out stopwords and punctuations can be an iterative process (Check out also Agoda Reviews project for reference). If you like to explore what other frameworks offer in terms of stopwords, you can check the list of other NLP packages in the Natural Language Processing section.

Last updated

Was this helpful?