Stopwords

A crucial aspect of NLP involves addressing "stop words." These are the words that occur frequently in a text but do not often provide significant insights on their own. For example, words such as "the," "and," and "I," while commonplace, typically do not contribute meaningful information about the specific topic of a document, hence the name stopwords. Eliminating these words help us better identify unique and relevant terms.

It's important to emphasize that there is no universally agreed-upon list of stop words in the field of NLP, each framework offers its own list of stop words. In the following, we will explore Natural Language Toolkit (NLTK) stop words list and compare it to that of WordCloud, a popular python library for word cloud plotting.

If you don't have the packages installed in your computer, simply run the following code on your termianl/command window:

pip install nltk wordcloud

NLTK’s Stopword List

In order to access NLTK's stopwords we first need to download the stopwords package:

import nltk
nltk.download('stopwords')

A graphical user interface (GUI) will appear when you are prompted. Select “All” and then click “Download”. This may take some time, so may want to grab a coffee (or tea) while waiting.

Once, it is downloaded, you can access the package via:

from nltk.corpus import stopwords as sw

To see the list of languages NLTK provides stopword:

print(len(sw.fileids()))
print(sw.fileids())
29
['arabic', 'azerbaijani', 'basque', 'bengali', 'catalan', 'chinese', 'danish', 'dutch',
 'english', 'finnish', 'french', 'german', 'greek', 'hebrew', 'hinglish', 'hungarian',
 'indonesian', 'italian', 'kazakh', 'nepali', 'norwegian', 'portuguese', 'romanian',
 'russian', 'slovene', 'spanish', 'swedish', 'tajik', 'turkish']

At the time of this writing, NLTK supports stopwords in 29 languages.To access the stopwords for a given language we use words module. Lets check the number of stopwords NLTK provides per language:

print("Number of stop words per language:") 
for i,l in enumerate(sw.fileids()): 
    print(f"{i+1}. {l.title()} - {len(sw.words(l))}")
Number of stop words per language:
1. Arabic - 754
2. Azerbaijani - 165
3. Basque - 326
4. Bengali - 398
5. Catalan - 278
6. Chinese - 841
7. Danish - 94
8. Dutch - 101
9. English - 179
10. Finnish - 235
11. French - 157
12. German - 232
13. Greek - 265
14. Hebrew - 221
15. Hinglish - 1036
16. Hungarian - 199
17. Indonesian - 758
18. Italian - 279
19. Kazakh - 324
20. Nepali - 255
21. Norwegian - 176
22. Portuguese - 207
23. Romanian - 356
24. Russian - 151
25. Slovene - 1784
26. Spanish - 313
27. Swedish - 114
28. Tajik - 163
29. Turkish - 53

English language has 179 stopwords in NLTK at the

The Top3 languages with the highest number of stopwords:

  • Slovene - 1784

  • Hinglish (a mix of Hindi and English) - 1036

  • Arabic - 754

Let's take a closer look at the stopwords in 'English'.

stops = sw.words('english')
print(f"Number of stopwords in NLTK :{len(sorted(stops))}")
print(sorted(stops))
Number of stopwords in NLTK :179
['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and',
 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 
 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 
 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 
 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 
 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', 'her', 'here', 
 'hers', 'herself', 'him', 'himself', 'his', 'how', 'i', 'if', 'in', 'into', 'is', 
 'isn', "isn't", 'it', "it's", 'its', 'itself', 'just', 'll', 'm', 'ma', 'me', 
 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', 
 "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 
 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan',
 "shan't", 'she', "she's", 'should', "should've", 'shouldn', "shouldn't", 'so', 
 'some', 'such', 't', 'than', 'that', "that'll", 'the', 'their', 'theirs', 'them', 
 'themselves', 'then', 'there', 'these', 'they', 'this', 'those', 'through', 'to', 
 'too', 'under', 'until', 'up', 've', 'very', 'was', 'wasn', "wasn't", 'we', 'were', 
 'weren', "weren't", 'what', 'when', 'where', 'which', 'while', 'who', 'whom', 'why',
 'will', 'with', 'won', "won't", 'wouldn', "wouldn't", 'y', 'you', "you'd", "you'll",
 "you're", "you've", 'your', 'yours', 'yourself', 'yourselves']

WordCloud's Stopword List

We can access the list of WordCloud's English stopwords by simply importing the STOPWORD module

from wordcloud import STOPWORDS

print(f"Number of stopwords in WordCloud :{len(sorted(list(STOPWORDS)))}")
print(sorted(list(STOPWORDS)))
Number of stopwords in WordCloud :192
['a', 'about', 'above', 'after', 'again', 'against', 'all', 'also', 'am', 'an', 'and', 
'any', 'are', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below',
'between', 'both', 'but', 'by', 'can', "can't", 'cannot', 'com', 'could', "couldn't", 
'did', "didn't", 'do', 'does', "doesn't", 'doing', "don't", 'down', 'during', 'each', 
'else', 'ever', 'few', 'for', 'from', 'further', 'get', 'had', "hadn't", 'has', "hasn't",
 'have', "haven't", 'having', 'he', "he'd", "he'll", "he's", 'hence', 'her', 'here', 
 "here's", 'hers', 'herself', 'him', 'himself', 'his', 'how', "how's", 'however', 'http',
 'i', "i'd", "i'll", "i'm", "i've", 'if', 'in', 'into', 'is', "isn't", 'it', "it's",
 'its', 'itself', 'just', 'k', "let's", 'like', 'me', 'more', 'most', "mustn't", 'my', 
 'myself', 'no', 'nor', 'not', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 
 'otherwise', 'ought', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 'r', 'same', 
 'shall', "shan't", 'she', "she'd", "she'll", "she's", 'should', "shouldn't", 'since', 
 'so', 'some', 'such', 'than', 'that', "that's", 'the', 'their', 'theirs', 'them', 
 'themselves', 'then', 'there', "there's", 'therefore', 'these', 'they', "they'd", 
 "they'll", "they're", "they've", 'this', 'those', 'through', 'to', 'too', 'under', 
 'until', 'up', 'very', 'was', "wasn't", 'we', "we'd", "we'll", "we're", "we've", 
were', "weren't", 'what', "what's", 'when', "when's", 'where', "where's", 'which', 
'while', 'who', "who's", 'whom', 'why', "why's", 'with', "won't", 'would', "wouldn't", 
'www', 'you', "you'd", "you'll", "you're", "you've", 'your', 'yours', 'yourself', 
'yourselves']

The output shows us that WordCloud provides us with 192 stopwords in English. Now, it is time to compare both lists!

Stopwords: NLTK vs WordCloud

Though, there are multiple ways to extract the words unique to each library, arguably the easiest way to do it is through Numpy' s setdiff1d function, which finds the set difference of two arrays. Let's begin by importing the numpy library:

import numpy as np

The following code snippet will first create those 'difference' arrays then print the number of unique stopwords and the stopwords themselves, respectively per library:

# stopwords unique to the library
sw_nltk_only = np.setdiff1d(sorted(list(stops)),sorted(list(STOPWORDS)))
sw_wc_only = np.setdiff1d(sorted(list(STOPWORDS)),sorted(stops))


# print stopwords counts and the stopwords
print(f"\nStopwords only in NLTK: ({len(sw_nltk_only)})\n {sw_nltk_only}")
print(f"\nStopwords only in WordCloud: ({len(sw_wc_only)})\n {sw_wc_only}")

Output:

Stopwords only in NLTK: (35)
 ['ain' 'aren' 'couldn' 'd' 'didn' 'doesn' 'don' 'hadn' 'hasn' 'haven'
 'isn' 'll' 'm' 'ma' 'mightn' "mightn't" 'mustn' 'needn' "needn't" 'now'
 'o' 're' 's' 'shan' "should've" 'shouldn' 't' "that'll" 've' 'wasn'
 'weren' 'will' 'won' 'wouldn' 'y']

Stopwords only in WordCloud: (48)
 ['also' "can't" 'cannot' 'com' 'could' 'else' 'ever' 'get' "he'd" "he'll"
 "he's" 'hence' "here's" "how's" 'however' 'http' "i'd" "i'll" "i'm"
 "i've" 'k' "let's" 'like' 'otherwise' 'ought' 'r' 'shall' "she'd"
 "she'll" 'since' "that's" "there's" 'therefore' "they'd" "they'll"
 "they're" "they've" "we'd" "we'll" "we're" "we've" "what's" "when's"
 "where's" "who's" "why's" 'would' 'www']

The results show that there are 35 stopwords exist only in NLTK, and 48 stopwords only in Wordcloud libraries. Let's create a combined unique list of stopwords from both packages:

# create a combined list of stopwords
sw_cobined = stops + list(sw_wc_only)
sw_cobined = sorted(sw_cobined)

print(f"\nNumber of stopwords NLTK + WC: {len(sw_cobined)}\n {sw_cobined}")python
Number of stopwords in the combined list of NLTK & WC : 227
 ['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'also', 'am', 'an',
  'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before',
  'being', 'below', 'between', 'both', 'but', 'by', 'can', "can't", 'cannot', 'com', 
  'could', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn',
  "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'else', 'ever', 'few',
  'for', 'from', 'further', 'get', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 
  'have', 'haven', "haven't", 'having', 'he', "he'd", "he'll", "he's", 'hence', 'her',
  'here', "here's", 'hers', 'herself', 'him', 'himself', 'his', 'how', "how's", 
  'however', 'http', 'i', "i'd", "i'll", "i'm", "i've", 'if', 'in', 'into', 'is', 'isn',
  "isn't", 'it', "it's", 'its', 'itself', 'just', 'k', "let's", 'like', 'll', 'm', 'ma',
  'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn',
  "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 
  'other', 'otherwise', 'ought', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 'r',
  're', 's', 'same', 'shall', 'shan', "shan't", 'she', "she'd", "she'll", "she's", 'should',
  "should've", 'shouldn', "shouldn't", 'since', 'so', 'some', 'such', 't', 'than', 'that', 
  "that'll", "that's", 'the', 'their', 'theirs', 'them', 'themselves', 'then', 'there', 
  "there's", 'therefore', 'these', 'they', "they'd", "they'll", "they're", "they've", 
  'this', 'those', 'through', 'to', 'too', 'under', 'until', 'up', 've', 'very', 'was',
  'wasn', "wasn't", 'we', "we'd", "we'll", "we're", "we've", 'were', 'weren', "weren't",
  'what', "what's", 'when', "when's", 'where', "where's", 'which', 'while', 'who', "who's",
  'whom', 'why', "why's", 'will', 'with', 'won', "won't", 'would', 'wouldn', "wouldn't", 
  'www', 'y', 'you', "you'd", "you'll", "you're", "you've", 'your', 'yours', 'yourself', 
  'yourselves']

Now, it is time to put our stopwords in action!

Filtering Stopwords in Text

For this part of the exercise we will be using NLTK's Brown Corpus, which is a well-known text corpus that contains a collection of texts from various domains/categories, making it a great resource for language analysis.

Before we begin, make sure you have NLTK Brown Corpus downloaded. if you haven't already downloaded, you can do so by using the following

import nltk
nltk.download('brown')

To access the Brown Corpus, you can use the nltk.corpus module:

from nltk.corpus import brown

# List the categories (genres) available in the Brown Corpus
categories = brown.categories()
print(f"Number of categories: {len(categories)}\n")
print(categories)
Number of categories: 15

['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 
'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 
'science_fiction']

We will be working with news cateogry for this exercise

# Calculate the number of words in a category
num_words = len(brown.words(categories='news'))
print(f"Number of words in 'news' category: {num_words}")

# Calculate the number of sentences in a category
num_sentences = len(brown.sents(categories='news'))
print(f"Number of sentences in 'news' category: {num_sentences}")

# Calculate the average words per sentence
avg_words_per_sentence = num_words / num_sentences
print(f"Average words per sentence in 'news' category: {avg_words_per_sentence:.2f}")
Number of words in 'news' category: 100554
Number of sentences in 'news' category: 4623
Average words per sentence in 'news' category: 21.75

We will now create a frequency distribution of words in a our selected category to show why stopword removal is important:

from nltk import FreqDist

# Create a frequency distribution of words in the 'news' category
news_words = brown.words(categories='news')
fdist = FreqDist(news_words)

# Print the most common 15 words and their frequencies
common_words = fdist.most_common(15)
print("Most common words in 'news' category:")
i=0
for word, freq in common_words:
    i += 1
    print(f"{i}) {word}: {freq}")
Most common words in 'news' category:
1) the: 5580
2) ,: 5188
3) .: 4030
4) of: 2849
5) and: 2146
6) to: 2116
7) a: 1993
8) in: 1893
9) for: 943
10) The: 806
11) that: 802
12) ``: 732
13) is: 732
14) was: 717
15) '': 702

Notice that NLTK treats functions as words as well and the output shows us that the most common 15 words in our news category, which consists of 100,554 words, are either so called the stopwords or the punctuations! Using our sw_cobined, we now filter the stopwords from our text:

# filter our stopwords
news_words_filtered = [w.lower() for w in news_words if w.lower() not in sw_cobined]

# print stats
print(f"Number of words prior to stopwords filtering: {len(news_words):,}")
print(f"Number of words after stopwords filtering: {len(news_words_filtered):,}")
print(f"Difference: {len(news_words)-len(news_words_filtered):,} (words have been removed)")
Number of words prior to stopwords filtering: 100,554
Number of words after stopwords filtering: 62,849
Difference: 37,705 (words have been removed)

We use lower case for each word, w.lower(), to make sure 1) when calculate the frequency of a word we should ignore the case status to have the correct counts, 2) because our combined list only consists of lower case words, we need to make sure that we also convert each word before checking its existence in the stopwords list.

Time to create a frequency distribution without the stopwords and display the most common words after filtering:

# create a frequency distribution of words excluding stopwords
fdist_filtered = FreqDist(news_words_filtered)

# Print the most common words and their frequencies
common_words_filtered = fdist_filtered.most_common(15)
print("Most common words in 'news' category:")
i=0
for word, freq in common_words_filtered:
    i += 1
    print(f"{i}) {word}: {freq}")
Most common words in 'news' category:
1) ,: 5188
2) .: 4030
3) ``: 732
4) '': 702
5) said: 406
6) ;: 314
7) --: 300
8) mrs.: 253
9) new: 241
10) one: 213
11) last: 177
12) two: 174
13) ): 171
14) mr.: 170
15) (: 168

The output now displays a different set of words. Removal of words like "mr.", "mrs.", or "one" and "two" will depend on the insights we are trying to extract from our analysis, we can either leave them as is or add them to our combined list of stopwords and filter them out. However, unarguably, punctuations cause noise in our dataset and should be removed before furthering in our project. NLTK currently doesn't provide a module for punctions and generally we use python's string module instance of punctuation to obtain all punctions. (You can manually enter the punctuations too)

import string

# create a variable for punctuations
puncs = string.punctuation  #includes '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

# extend stopword cobined list
sw_cobined_extented = sw_cobined + ['mr.','mrs.']

# filter out stop words and punctions
news_words_filtered_2 = [w.lower() for w in news_words_filtered if w.lower() not in sw_cobined_extented and w not in puncs]

# create a frequency distribution of words excluding stopwords and punctions
fdist_filtered_2 = FreqDist(news_words_filtered_2)

# Print the most common words and their frequencies
common_words_filtered_2 = fdist_filtered_2.most_common(15)
print("Most common words in 'news' category:")
i=0
for word, freq in common_words_filtered_2:
    i += 1
    print(f"{i}) {word}: {freq}")
Most common words in 'news' category:
1) ``: 732
2) '': 702
3) said: 406
4) --: 300
5) new: 241
6) one: 213
7) last: 177
8) two: 174
9) first: 158
10) state: 153
11) year: 142
12) president: 142
13) home: 132
14) made: 107
15) time: 103

Though, punction instance helped us remove most of the punctions, we still see '``', "''", '--' remained in our most common word list. In such situations it is best to do manual filtering:

# filter out stop words and punctions
news_words_filtered_3 = [w.lower() for w in news_words_filtered_2 if (w.lower() not in sw_cobined_extented) and (w not in puncs) and (w not in ['``', "''", '--'])]

# create a frequency distribution of words excluding stopwords and punctions
fdist_filtered_3 = FreqDist(news_words_filtered_3)

# print the most common words and their frequencies
common_words_filtered_3 = fdist_filtered_3.most_common(15)
print("Most common words in 'news' category:")
i=0
for word, freq in common_words_filtered_3:
    i += 1
    print(f"{i}) {word}: {freq}")
Most common words in 'news' category:
1) said: 406
2) new: 241
3) one: 213
4) last: 177
5) two: 174
6) first: 158
7) state: 153
8) year: 142
9) president: 142
10) home: 132
11) made: 107
12) time: 103
13) years: 102
14) three: 101
15) house: 97

Now, we have a list that consists of only the most common words along with their frequencies.

In this exercise, we explored common words, called stopwords, that occur often in a text but do not provide us significant insights. We also saw that different libraries like NLTK and WorldCloud have their own stopword lists and that there is no such thing as an ultimate list for stopwords. We then filter out those words from one of the categories in NLTK's Brown corpus. Exploring word frequencies in the dataset showed us that removal of stopwords and the punctuations is an iterative process. This brings us to the gist of this exercise that when we work with text data, we need to be aware of the fact that stopword list we are given is limited and we can always combine multiple lists or even create our own stopword list specific to the needs of our project, and the filtering out stopwords and punctuations can be an iterative process (Check out also Agoda Reviews project for reference). If you like to explore what other frameworks offer in terms of stopwords, you can check the list of other NLP packages in the Natural Language Processing section.

Last updated