Data Science Hub
  • Data Science Hub
  • STATISTICS
    • Introduction
    • Fundamentals
      • Data Types
      • Central Tendency, Asymmetry, and Variability
      • Sampling
      • Confidence Interval
      • Hypothesis Testing
    • Distributions
      • Exponential Distribution
    • A/B Testing
      • Sample Size Calculation
      • Multiple Testing
  • Database
    • Database Fundamentals
    • Database Management Systems
    • Data Warehouse vs Data Lake
  • SQL
    • SQL Basics
      • Creating and Modifying Tables/Views
      • Data Types
      • Joins
    • SQL Rules
    • SQL Aggregate Functions
    • SQL Window Functions
    • SQL Data Manipulation
      • String Operations
      • Date/Time Operations
    • SQL Descriptive Stats
    • SQL Tips
    • SQL Performance Tuning
    • SQL Customization
    • SQL Practice
      • Designing Databases
        • Spotify Database Design
      • Most Commonly Asked
      • Mixed Queries
      • Popular Websites For SQL Practice
        • SQLZoo
          • World - BBC Tables
            • SUM and COUNT Tutorial
            • SELECT within SELECT Tutorial
            • SELECT from WORLD Tutorial
            • Select Quiz
            • BBC QUIZ
            • Nested SELECT Quiz
            • SUM and COUNT Quiz
          • Nobel Table
            • SELECT from Nobel Tutorial
            • Nobel Quiz
          • Soccer / Football Tables
            • JOIN Tutorial
            • JOIN Quiz
          • Movie / Actor / Casting Tables
            • More JOIN Operations Tutorial
            • JOIN Quiz 2
          • Teacher - Dept Tables
            • Using Null Quiz
          • Edinburgh Buses Table
            • Self join Quiz
        • HackerRank
          • SQL (Basic)
            • Select All
            • Select By ID
            • Japanese Cities' Attributes
            • Revising the Select Query I
            • Revising the Select Query II
            • Revising Aggregations - The Count Function
            • Revising Aggregations - The Sum Function
            • Revising Aggregations - Averages
            • Average Population
            • Japan Population
            • Population Density Difference
            • Population Census
            • African Cities
            • Average Population of Each Continent
            • Weather Observation Station 1
            • Weather Observation Station 2
            • Weather Observation Station 3
            • Weather Observation Station 4
            • Weather Observation Station 6
            • Weather Observation Station 7
            • Weather Observation Station 8
            • Weather Observation Station 9
            • Weather Observation Station 10
            • Weather Observation Station 11
            • Weather Observation Station 12
            • Weather Observation Station 13
            • Weather Observation Station 14
            • Weather Observation Station 15
            • Weather Observation Station 16
            • Weather Observation Station 17
            • Weather Observation Station 18
            • Weather Observation Station 19
            • Higher Than 75 Marks
            • Employee Names
            • Employee Salaries
            • The Blunder
            • Top Earners
            • Type of Triangle
            • The PADS
          • SQL (Intermediate)
            • Weather Observation Station 5
            • Weather Observation Station 20
            • New Companies
            • The Report
            • Top Competitors
            • Ollivander's Inventory
            • Challenges
            • Contest Leaderboard
            • SQL Project Planning
            • Placements
            • Symmetric Pairs
            • Binary Tree Nodes
            • Interviews
            • Occupations
          • SQL (Advanced)
            • Draw The Triangle 1
            • Draw The Triangle 2
            • Print Prime Numbers
            • 15 Days of Learning SQL
          • TABLES
            • City - Country
            • Station
            • Hackers - Submissions
            • Students
            • Employee - Employees
            • Occupations
            • Triangles
        • StrataScratch
          • Netflix
            • Oscar Nominees Table
            • Nominee Filmography Table
            • Nominee Information Table
          • Audible
            • Easy - Audible
          • Spotify
            • Worldwide Daily Song Ranking Table
            • Billboard Top 100 Year End Table
            • Daily Rankings 2017 US
          • Google
            • Easy - Google
            • Medium - Google
            • Hard - Google
        • LeetCode
          • Easy
  • Python
    • Basics
      • Variables and DataTypes
        • Lists
        • Dictionaries
      • Control Flow
      • Functions
    • Object Oriented Programming
      • Restaurant Modeler
    • Pythonic Resources
    • Projects
  • Machine Learning
    • Fundamentals
      • Supervised Learning
        • Classification Algorithms
          • k-Nearest Neighbors
            • kNN Parameters & Attributes
          • Logistic Regression
        • Classification Report
      • UnSupervised Learning
        • Clustering
          • Evaluation
      • Preprocessing
        • Scalers: Standard vs MinMax
        • Feature Selection vs Dimensionality Reduction
        • Encoding
    • Frameworks
    • Machine Learning in Advertising
    • Natural Language Processing
      • Stopwords
      • Name Entity Recognition (NER)
      • Sentiment Analysis
        • Agoda Reviews - Part I - Scraping Reviews, Detecting Languages, and Preprocessing
        • Agoda Reviews - Part II - Sentiment Analysis and WordClouds
    • Recommendation Systems
      • Spotify Recommender System - Artists
  • Geospatial Analysis
    • Geospatial Analysis Basics
    • GSA at Work
      • Web Scraping and Mapping
  • GIT
    • GIT Essentials
    • Connecting to GitHub
  • FAQ
    • Statistics
  • Cloud Computing
    • Introduction to Cloud Computing
    • Google Cloud Platform
  • Docker
    • What is Docker?
Powered by GitBook
On this page
  • NLTK’s Stopword List
  • WordCloud's Stopword List
  • Stopwords: NLTK vs WordCloud
  • Filtering Stopwords in Text

Was this helpful?

  1. Machine Learning
  2. Natural Language Processing

Stopwords

Last updated 11 months ago

Was this helpful?

A crucial aspect of NLP involves addressing "stop words." These are the words that occur frequently in a text but do not often provide significant insights on their own. For example, words such as "the," "and," and "I," while commonplace, typically do not contribute meaningful information about the specific topic of a document, hence the name stopwords. Eliminating these words help us better identify unique and relevant terms.

It's important to emphasize that there is no universally agreed-upon list of stop words in the field of NLP, each framework offers its own list of stop words. In the following, we will explore stop words list and compare it to that of , a popular python library for word cloud plotting.

If you don't have the packages installed in your computer, simply run the following code on your termianl/command window:

pip install nltk wordcloud

NLTK’s Stopword List

In order to access NLTK's stopwords we first need to download the stopwords package:

import nltk
nltk.download('stopwords')

A graphical user interface (GUI) will appear when you are prompted. Select “All” and then click “Download”. This may take some time, so may want to grab a coffee (or tea) while waiting.

Once, it is downloaded, you can access the package via:

from nltk.corpus import stopwords as sw

To see the list of languages NLTK provides stopword:

print(len(sw.fileids()))
print(sw.fileids())
29
['arabic', 'azerbaijani', 'basque', 'bengali', 'catalan', 'chinese', 'danish', 'dutch',
 'english', 'finnish', 'french', 'german', 'greek', 'hebrew', 'hinglish', 'hungarian',
 'indonesian', 'italian', 'kazakh', 'nepali', 'norwegian', 'portuguese', 'romanian',
 'russian', 'slovene', 'spanish', 'swedish', 'tajik', 'turkish']

At the time of this writing, NLTK supports stopwords in 29 languages.To access the stopwords for a given language we use words module. Lets check the number of stopwords NLTK provides per language:

print("Number of stop words per language:") 
for i,l in enumerate(sw.fileids()): 
    print(f"{i+1}. {l.title()} - {len(sw.words(l))}")
Number of stop words per language:
1. Arabic - 754
2. Azerbaijani - 165
3. Basque - 326
4. Bengali - 398
5. Catalan - 278
6. Chinese - 841
7. Danish - 94
8. Dutch - 101
9. English - 179
10. Finnish - 235
11. French - 157
12. German - 232
13. Greek - 265
14. Hebrew - 221
15. Hinglish - 1036
16. Hungarian - 199
17. Indonesian - 758
18. Italian - 279
19. Kazakh - 324
20. Nepali - 255
21. Norwegian - 176
22. Portuguese - 207
23. Romanian - 356
24. Russian - 151
25. Slovene - 1784
26. Spanish - 313
27. Swedish - 114
28. Tajik - 163
29. Turkish - 53

English language has 179 stopwords in NLTK at the

The Top3 languages with the highest number of stopwords:

  • Slovene - 1784

  • Hinglish (a mix of Hindi and English) - 1036

  • Arabic - 754

Let's take a closer look at the stopwords in 'English'.

stops = sw.words('english')
print(f"Number of stopwords in NLTK :{len(sorted(stops))}")
print(sorted(stops))
Number of stopwords in NLTK :179
['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and',
 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 
 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 
 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 
 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 
 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', 'her', 'here', 
 'hers', 'herself', 'him', 'himself', 'his', 'how', 'i', 'if', 'in', 'into', 'is', 
 'isn', "isn't", 'it', "it's", 'its', 'itself', 'just', 'll', 'm', 'ma', 'me', 
 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', 
 "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 
 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan',
 "shan't", 'she', "she's", 'should', "should've", 'shouldn', "shouldn't", 'so', 
 'some', 'such', 't', 'than', 'that', "that'll", 'the', 'their', 'theirs', 'them', 
 'themselves', 'then', 'there', 'these', 'they', 'this', 'those', 'through', 'to', 
 'too', 'under', 'until', 'up', 've', 'very', 'was', 'wasn', "wasn't", 'we', 'were', 
 'weren', "weren't", 'what', 'when', 'where', 'which', 'while', 'who', 'whom', 'why',
 'will', 'with', 'won', "won't", 'wouldn', "wouldn't", 'y', 'you', "you'd", "you'll",
 "you're", "you've", 'your', 'yours', 'yourself', 'yourselves']

WordCloud's Stopword List

We can access the list of WordCloud's English stopwords by simply importing the STOPWORD module

from wordcloud import STOPWORDS

print(f"Number of stopwords in WordCloud :{len(sorted(list(STOPWORDS)))}")
print(sorted(list(STOPWORDS)))
Number of stopwords in WordCloud :192
['a', 'about', 'above', 'after', 'again', 'against', 'all', 'also', 'am', 'an', 'and', 
'any', 'are', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below',
'between', 'both', 'but', 'by', 'can', "can't", 'cannot', 'com', 'could', "couldn't", 
'did', "didn't", 'do', 'does', "doesn't", 'doing', "don't", 'down', 'during', 'each', 
'else', 'ever', 'few', 'for', 'from', 'further', 'get', 'had', "hadn't", 'has', "hasn't",
 'have', "haven't", 'having', 'he', "he'd", "he'll", "he's", 'hence', 'her', 'here', 
 "here's", 'hers', 'herself', 'him', 'himself', 'his', 'how', "how's", 'however', 'http',
 'i', "i'd", "i'll", "i'm", "i've", 'if', 'in', 'into', 'is', "isn't", 'it', "it's",
 'its', 'itself', 'just', 'k', "let's", 'like', 'me', 'more', 'most', "mustn't", 'my', 
 'myself', 'no', 'nor', 'not', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 
 'otherwise', 'ought', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 'r', 'same', 
 'shall', "shan't", 'she', "she'd", "she'll", "she's", 'should', "shouldn't", 'since', 
 'so', 'some', 'such', 'than', 'that', "that's", 'the', 'their', 'theirs', 'them', 
 'themselves', 'then', 'there', "there's", 'therefore', 'these', 'they', "they'd", 
 "they'll", "they're", "they've", 'this', 'those', 'through', 'to', 'too', 'under', 
 'until', 'up', 'very', 'was', "wasn't", 'we', "we'd", "we'll", "we're", "we've", 
were', "weren't", 'what', "what's", 'when', "when's", 'where', "where's", 'which', 
'while', 'who', "who's", 'whom', 'why', "why's", 'with', "won't", 'would', "wouldn't", 
'www', 'you', "you'd", "you'll", "you're", "you've", 'your', 'yours', 'yourself', 
'yourselves']

The output shows us that WordCloud provides us with 192 stopwords in English. Now, it is time to compare both lists!

Stopwords: NLTK vs WordCloud

Though, there are multiple ways to extract the words unique to each library, arguably the easiest way to do it is through Numpy' s setdiff1d function, which finds the set difference of two arrays. Let's begin by importing the numpy library:

import numpy as np

The following code snippet will first create those 'difference' arrays then print the number of unique stopwords and the stopwords themselves, respectively per library:

# stopwords unique to the library
sw_nltk_only = np.setdiff1d(sorted(list(stops)),sorted(list(STOPWORDS)))
sw_wc_only = np.setdiff1d(sorted(list(STOPWORDS)),sorted(stops))


# print stopwords counts and the stopwords
print(f"\nStopwords only in NLTK: ({len(sw_nltk_only)})\n {sw_nltk_only}")
print(f"\nStopwords only in WordCloud: ({len(sw_wc_only)})\n {sw_wc_only}")

Output:

Stopwords only in NLTK: (35)
 ['ain' 'aren' 'couldn' 'd' 'didn' 'doesn' 'don' 'hadn' 'hasn' 'haven'
 'isn' 'll' 'm' 'ma' 'mightn' "mightn't" 'mustn' 'needn' "needn't" 'now'
 'o' 're' 's' 'shan' "should've" 'shouldn' 't' "that'll" 've' 'wasn'
 'weren' 'will' 'won' 'wouldn' 'y']

Stopwords only in WordCloud: (48)
 ['also' "can't" 'cannot' 'com' 'could' 'else' 'ever' 'get' "he'd" "he'll"
 "he's" 'hence' "here's" "how's" 'however' 'http' "i'd" "i'll" "i'm"
 "i've" 'k' "let's" 'like' 'otherwise' 'ought' 'r' 'shall' "she'd"
 "she'll" 'since' "that's" "there's" 'therefore' "they'd" "they'll"
 "they're" "they've" "we'd" "we'll" "we're" "we've" "what's" "when's"
 "where's" "who's" "why's" 'would' 'www']

The results show that there are 35 stopwords exist only in NLTK, and 48 stopwords only in Wordcloud libraries. Let's create a combined unique list of stopwords from both packages:

# create a combined list of stopwords
sw_cobined = stops + list(sw_wc_only)
sw_cobined = sorted(sw_cobined)

print(f"\nNumber of stopwords NLTK + WC: {len(sw_cobined)}\n {sw_cobined}")python
Number of stopwords in the combined list of NLTK & WC : 227
 ['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'also', 'am', 'an',
  'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before',
  'being', 'below', 'between', 'both', 'but', 'by', 'can', "can't", 'cannot', 'com', 
  'could', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn',
  "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'else', 'ever', 'few',
  'for', 'from', 'further', 'get', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 
  'have', 'haven', "haven't", 'having', 'he', "he'd", "he'll", "he's", 'hence', 'her',
  'here', "here's", 'hers', 'herself', 'him', 'himself', 'his', 'how', "how's", 
  'however', 'http', 'i', "i'd", "i'll", "i'm", "i've", 'if', 'in', 'into', 'is', 'isn',
  "isn't", 'it', "it's", 'its', 'itself', 'just', 'k', "let's", 'like', 'll', 'm', 'ma',
  'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn',
  "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 
  'other', 'otherwise', 'ought', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 'r',
  're', 's', 'same', 'shall', 'shan', "shan't", 'she', "she'd", "she'll", "she's", 'should',
  "should've", 'shouldn', "shouldn't", 'since', 'so', 'some', 'such', 't', 'than', 'that', 
  "that'll", "that's", 'the', 'their', 'theirs', 'them', 'themselves', 'then', 'there', 
  "there's", 'therefore', 'these', 'they', "they'd", "they'll", "they're", "they've", 
  'this', 'those', 'through', 'to', 'too', 'under', 'until', 'up', 've', 'very', 'was',
  'wasn', "wasn't", 'we', "we'd", "we'll", "we're", "we've", 'were', 'weren', "weren't",
  'what', "what's", 'when', "when's", 'where', "where's", 'which', 'while', 'who', "who's",
  'whom', 'why', "why's", 'will', 'with', 'won', "won't", 'would', 'wouldn', "wouldn't", 
  'www', 'y', 'you', "you'd", "you'll", "you're", "you've", 'your', 'yours', 'yourself', 
  'yourselves']

Now, it is time to put our stopwords in action!

Filtering Stopwords in Text

For this part of the exercise we will be using NLTK's Brown Corpus, which is a well-known text corpus that contains a collection of texts from various domains/categories, making it a great resource for language analysis.

Before we begin, make sure you have NLTK Brown Corpus downloaded. if you haven't already downloaded, you can do so by using the following

import nltk
nltk.download('brown')

To access the Brown Corpus, you can use the nltk.corpus module:

from nltk.corpus import brown

# List the categories (genres) available in the Brown Corpus
categories = brown.categories()
print(f"Number of categories: {len(categories)}\n")
print(categories)
Number of categories: 15

['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 
'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 
'science_fiction']

We will be working with news cateogry for this exercise

# Calculate the number of words in a category
num_words = len(brown.words(categories='news'))
print(f"Number of words in 'news' category: {num_words}")

# Calculate the number of sentences in a category
num_sentences = len(brown.sents(categories='news'))
print(f"Number of sentences in 'news' category: {num_sentences}")

# Calculate the average words per sentence
avg_words_per_sentence = num_words / num_sentences
print(f"Average words per sentence in 'news' category: {avg_words_per_sentence:.2f}")
Number of words in 'news' category: 100554
Number of sentences in 'news' category: 4623
Average words per sentence in 'news' category: 21.75

We will now create a frequency distribution of words in a our selected category to show why stopword removal is important:

from nltk import FreqDist

# Create a frequency distribution of words in the 'news' category
news_words = brown.words(categories='news')
fdist = FreqDist(news_words)

# Print the most common 15 words and their frequencies
common_words = fdist.most_common(15)
print("Most common words in 'news' category:")
i=0
for word, freq in common_words:
    i += 1
    print(f"{i}) {word}: {freq}")
Most common words in 'news' category:
1) the: 5580
2) ,: 5188
3) .: 4030
4) of: 2849
5) and: 2146
6) to: 2116
7) a: 1993
8) in: 1893
9) for: 943
10) The: 806
11) that: 802
12) ``: 732
13) is: 732
14) was: 717
15) '': 702

Notice that NLTK treats functions as words as well and the output shows us that the most common 15 words in our news category, which consists of 100,554 words, are either so called the stopwords or the punctuations! Using our sw_cobined, we now filter the stopwords from our text:

# filter our stopwords
news_words_filtered = [w.lower() for w in news_words if w.lower() not in sw_cobined]

# print stats
print(f"Number of words prior to stopwords filtering: {len(news_words):,}")
print(f"Number of words after stopwords filtering: {len(news_words_filtered):,}")
print(f"Difference: {len(news_words)-len(news_words_filtered):,} (words have been removed)")
Number of words prior to stopwords filtering: 100,554
Number of words after stopwords filtering: 62,849
Difference: 37,705 (words have been removed)

We use lower case for each word, w.lower(), to make sure 1) when calculate the frequency of a word we should ignore the case status to have the correct counts, 2) because our combined list only consists of lower case words, we need to make sure that we also convert each word before checking its existence in the stopwords list.

Time to create a frequency distribution without the stopwords and display the most common words after filtering:

# create a frequency distribution of words excluding stopwords
fdist_filtered = FreqDist(news_words_filtered)

# Print the most common words and their frequencies
common_words_filtered = fdist_filtered.most_common(15)
print("Most common words in 'news' category:")
i=0
for word, freq in common_words_filtered:
    i += 1
    print(f"{i}) {word}: {freq}")
Most common words in 'news' category:
1) ,: 5188
2) .: 4030
3) ``: 732
4) '': 702
5) said: 406
6) ;: 314
7) --: 300
8) mrs.: 253
9) new: 241
10) one: 213
11) last: 177
12) two: 174
13) ): 171
14) mr.: 170
15) (: 168

The output now displays a different set of words. Removal of words like "mr.", "mrs.", or "one" and "two" will depend on the insights we are trying to extract from our analysis, we can either leave them as is or add them to our combined list of stopwords and filter them out. However, unarguably, punctuations cause noise in our dataset and should be removed before furthering in our project. NLTK currently doesn't provide a module for punctions and generally we use python's string module instance of punctuation to obtain all punctions. (You can manually enter the punctuations too)

import string

# create a variable for punctuations
puncs = string.punctuation  #includes '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

# extend stopword cobined list
sw_cobined_extented = sw_cobined + ['mr.','mrs.']

# filter out stop words and punctions
news_words_filtered_2 = [w.lower() for w in news_words_filtered if w.lower() not in sw_cobined_extented and w not in puncs]

# create a frequency distribution of words excluding stopwords and punctions
fdist_filtered_2 = FreqDist(news_words_filtered_2)

# Print the most common words and their frequencies
common_words_filtered_2 = fdist_filtered_2.most_common(15)
print("Most common words in 'news' category:")
i=0
for word, freq in common_words_filtered_2:
    i += 1
    print(f"{i}) {word}: {freq}")
Most common words in 'news' category:
1) ``: 732
2) '': 702
3) said: 406
4) --: 300
5) new: 241
6) one: 213
7) last: 177
8) two: 174
9) first: 158
10) state: 153
11) year: 142
12) president: 142
13) home: 132
14) made: 107
15) time: 103

Though, punction instance helped us remove most of the punctions, we still see '``', "''", '--' remained in our most common word list. In such situations it is best to do manual filtering:

# filter out stop words and punctions
news_words_filtered_3 = [w.lower() for w in news_words_filtered_2 if (w.lower() not in sw_cobined_extented) and (w not in puncs) and (w not in ['``', "''", '--'])]

# create a frequency distribution of words excluding stopwords and punctions
fdist_filtered_3 = FreqDist(news_words_filtered_3)

# print the most common words and their frequencies
common_words_filtered_3 = fdist_filtered_3.most_common(15)
print("Most common words in 'news' category:")
i=0
for word, freq in common_words_filtered_3:
    i += 1
    print(f"{i}) {word}: {freq}")
Most common words in 'news' category:
1) said: 406
2) new: 241
3) one: 213
4) last: 177
5) two: 174
6) first: 158
7) state: 153
8) year: 142
9) president: 142
10) home: 132
11) made: 107
12) time: 103
13) years: 102
14) three: 101
15) house: 97

Now, we have a list that consists of only the most common words along with their frequencies.

In this exercise, we explored common words, called stopwords, that occur often in a text but do not provide us significant insights. We also saw that different libraries like NLTK and WorldCloud have their own stopword lists and that there is no such thing as an ultimate list for stopwords. We then filter out those words from one of the categories in NLTK's Brown corpus. Exploring word frequencies in the dataset showed us that removal of stopwords and the punctuations is an iterative process. This brings us to the gist of this exercise that when we work with text data, we need to be aware of the fact that stopword list we are given is limited and we can always combine multiple lists or even create our own stopword list specific to the needs of our project, and the filtering out stopwords and punctuations can be an iterative process (Check out also Agoda Reviews project for reference). If you like to explore what other frameworks offer in terms of stopwords, you can check the list of other NLP packages in the section.

Natural Language Toolkit (NLTK)
WordCloud
Natural Language Processing