Data Science Hub
  • Data Science Hub
  • STATISTICS
    • Introduction
    • Fundamentals
      • Data Types
      • Central Tendency, Asymmetry, and Variability
      • Sampling
      • Confidence Interval
      • Hypothesis Testing
    • Distributions
      • Exponential Distribution
    • A/B Testing
      • Sample Size Calculation
      • Multiple Testing
  • Database
    • Database Fundamentals
    • Database Management Systems
    • Data Warehouse vs Data Lake
  • SQL
    • SQL Basics
      • Creating and Modifying Tables/Views
      • Data Types
      • Joins
    • SQL Rules
    • SQL Aggregate Functions
    • SQL Window Functions
    • SQL Data Manipulation
      • String Operations
      • Date/Time Operations
    • SQL Descriptive Stats
    • SQL Tips
    • SQL Performance Tuning
    • SQL Customization
    • SQL Practice
      • Designing Databases
        • Spotify Database Design
      • Most Commonly Asked
      • Mixed Queries
      • Popular Websites For SQL Practice
        • SQLZoo
          • World - BBC Tables
            • SUM and COUNT Tutorial
            • SELECT within SELECT Tutorial
            • SELECT from WORLD Tutorial
            • Select Quiz
            • BBC QUIZ
            • Nested SELECT Quiz
            • SUM and COUNT Quiz
          • Nobel Table
            • SELECT from Nobel Tutorial
            • Nobel Quiz
          • Soccer / Football Tables
            • JOIN Tutorial
            • JOIN Quiz
          • Movie / Actor / Casting Tables
            • More JOIN Operations Tutorial
            • JOIN Quiz 2
          • Teacher - Dept Tables
            • Using Null Quiz
          • Edinburgh Buses Table
            • Self join Quiz
        • HackerRank
          • SQL (Basic)
            • Select All
            • Select By ID
            • Japanese Cities' Attributes
            • Revising the Select Query I
            • Revising the Select Query II
            • Revising Aggregations - The Count Function
            • Revising Aggregations - The Sum Function
            • Revising Aggregations - Averages
            • Average Population
            • Japan Population
            • Population Density Difference
            • Population Census
            • African Cities
            • Average Population of Each Continent
            • Weather Observation Station 1
            • Weather Observation Station 2
            • Weather Observation Station 3
            • Weather Observation Station 4
            • Weather Observation Station 6
            • Weather Observation Station 7
            • Weather Observation Station 8
            • Weather Observation Station 9
            • Weather Observation Station 10
            • Weather Observation Station 11
            • Weather Observation Station 12
            • Weather Observation Station 13
            • Weather Observation Station 14
            • Weather Observation Station 15
            • Weather Observation Station 16
            • Weather Observation Station 17
            • Weather Observation Station 18
            • Weather Observation Station 19
            • Higher Than 75 Marks
            • Employee Names
            • Employee Salaries
            • The Blunder
            • Top Earners
            • Type of Triangle
            • The PADS
          • SQL (Intermediate)
            • Weather Observation Station 5
            • Weather Observation Station 20
            • New Companies
            • The Report
            • Top Competitors
            • Ollivander's Inventory
            • Challenges
            • Contest Leaderboard
            • SQL Project Planning
            • Placements
            • Symmetric Pairs
            • Binary Tree Nodes
            • Interviews
            • Occupations
          • SQL (Advanced)
            • Draw The Triangle 1
            • Draw The Triangle 2
            • Print Prime Numbers
            • 15 Days of Learning SQL
          • TABLES
            • City - Country
            • Station
            • Hackers - Submissions
            • Students
            • Employee - Employees
            • Occupations
            • Triangles
        • StrataScratch
          • Netflix
            • Oscar Nominees Table
            • Nominee Filmography Table
            • Nominee Information Table
          • Audible
            • Easy - Audible
          • Spotify
            • Worldwide Daily Song Ranking Table
            • Billboard Top 100 Year End Table
            • Daily Rankings 2017 US
          • Google
            • Easy - Google
            • Medium - Google
            • Hard - Google
        • LeetCode
          • Easy
  • Python
    • Basics
      • Variables and DataTypes
        • Lists
        • Dictionaries
      • Control Flow
      • Functions
    • Object Oriented Programming
      • Restaurant Modeler
    • Pythonic Resources
    • Projects
  • Machine Learning
    • Fundamentals
      • Supervised Learning
        • Classification Algorithms
          • k-Nearest Neighbors
            • kNN Parameters & Attributes
          • Logistic Regression
        • Classification Report
      • UnSupervised Learning
        • Clustering
          • Evaluation
      • Preprocessing
        • Scalers: Standard vs MinMax
        • Feature Selection vs Dimensionality Reduction
        • Encoding
    • Frameworks
    • Machine Learning in Advertising
    • Natural Language Processing
      • Stopwords
      • Name Entity Recognition (NER)
      • Sentiment Analysis
        • Agoda Reviews - Part I - Scraping Reviews, Detecting Languages, and Preprocessing
        • Agoda Reviews - Part II - Sentiment Analysis and WordClouds
    • Recommendation Systems
      • Spotify Recommender System - Artists
  • Geospatial Analysis
    • Geospatial Analysis Basics
    • GSA at Work
      • Web Scraping and Mapping
  • GIT
    • GIT Essentials
    • Connecting to GitHub
  • FAQ
    • Statistics
  • Cloud Computing
    • Introduction to Cloud Computing
    • Google Cloud Platform
  • Docker
    • What is Docker?
Powered by GitBook
On this page
  • 1) Exploratory Data Analysis
  • a) Number of Reviews per Rating
  • b) Sentiment Analysis: Negative and Positive Reviews
  • c) Sentiment analysis over Top 10 Languages with the Most Number of Reviews
  • 2) WordClouds
  • Conclusion

Was this helpful?

  1. Machine Learning
  2. Natural Language Processing
  3. Sentiment Analysis

Agoda Reviews - Part II - Sentiment Analysis and WordClouds

Last updated 4 months ago

Was this helpful?

In the of this project, we have extracted Agoda's reviews, detected the languages of the review contents and pre-processed the entire dataset to apply the Exploratory Data Analysis, Sentiment Analysis, and Visualization. Though, it is not required to follow the steps in Part I to perform the analyses in this section, it would be good to check them for the integrity of the project.

In this phase, we will be starting our analyses by importing the data that we have saved in the Part I. However, first thing first, here are the packages will need for this section:

import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
from wordcloud import WordCloud, STOPWORDS

1) Exploratory Data Analysis

Let us import the data and check the first 5 rows:

data = pd.read_csv('./agoda_reviews_231019_with_lang_unknown_dropped.csv')
data.head()

Time to check the overall information about the dataset by using pandas .info() method.

data.info()

.info() shows us that there are 12 columns with object and int64 datatypes, 74306 rows, i.e. records, and the dataset uses 6.8MB memory. The columns content, replyContent, repliedAt, appVersion, and content with emojis have NULL values.

We'll now explore our data to gain insights and identify trends. Since we have already covered the number of reviews per language in Part I, we begin with displaying the score/rating distribution and then check the positive vs negative review distributions:

a) Number of Reviews per Rating

Let us create a summary tables called scores and gather overall information in this dataframe:

scores = (
    data['score']
    .value_counts()
    .reset_index()
    .sort_values(by='score',ascending=False)
)
scores.reset_index(drop=True, inplace=True)
scores

The code snippet above gives us the number of reviews per rating. According to our calculation, there are 46820 5-Star reviews, 8107 4-Star reviews, and so on. Let's plot the results:

# create a bar plot
ax = scores.plot.bar(x='score',
                y='count', 
                rot=0, 
                color='cyan')

# set axes and title parameters
ax.set_title('Rating Distribution')
ax.set_ylabel('Number of Reviews')
ax.set_ylim(0,80000)

# annotate the bars
for x,y in scores["count"].items():
    plt.annotate(text = y, 
                 xy = (x,y), 
                 textcoords="offset points", 
                 xytext=(1,2), 
                 ha="center")

Though, we have the numbers, a better representation would be the relative frequency distribution, i.e. percentages, of each rating group:

scores['pct'] = round(scores['count']/scores['count'].sum()*100,2)
scores

We now know that 5-Star review counts correspond to ~63% of all reviews.

b) Sentiment Analysis: Negative and Positive Reviews

"Negative" and "Positive" to tags will be added to the reviews in the following logic:

  • Reviews with 4 and 5 rating -> will be considered as "Positive"

  • Reviews with 1,2, and 3 rating -> will be considered as "Negative"

Let us create a binary column called review_type for "positive" or "negative" tags based on our methodology above:

data['review_type'] = np.where(data['score']<=3, 'negative', 'positive')
data.head()

Now add this info to our summary table, scores:

scores['review_type'] = np.where(scores['score']<=3, 'negative', 'positive')
scores

and using pandas groupby function we can the overall positive vs negative reviews distribution:

pos_neg_dist = scores.groupby('review_type').agg({'count':'sum','pct':'sum'})
pos_neg_dist

According to our analysis, Agoda's app has 54927 positive and 19379 negative reviews which correspond to 73.92% and 26.08%, respectively. The percentage of the negative reviews seem to be high and require further analysis, which we will be doing in the next section; but before that let us visualize the review distribution of the languages with the top 10 most number of reviews (Also, keep in mind that we have considered "rating 3" reviews as negative, which is equivalent of ~2.7% all reviews, and one can argue that this group of reviews belongs to the positive reviews! Even with that the the negative review percentage s 23.38%)

Positive vs Negative Reviews with Rating 3 as Positive
scores['review_type_2'] = np.where(scores['score']<=2, 'negative', 'positive')
scores.groupby('review_type_2').agg({'count':'sum','pct':'sum'})

c) Sentiment analysis over Top 10 Languages with the Most Number of Reviews

Using pandas value_counts method we will extract the top 10 languages with the most number of reviews:

top10 = data['language'].value_counts().reset_index()[:10]
top10

Let's now plot the this data using pandas plot function. Be aware that the number of English reviews are two orders of magnitude larger than that of other languages; therefore, we will use logarithmic scale for plotting:

top10.plot.barh(y='count',
                x='language', 
                grid='yaxis',
                title='Top 10 Languages wiht Highest Number of Reviews',
                xlabel='Number of Reviews',
                ylabel='Languages',
                logx=True)
                
plt.gca().invert_yaxis()

Let's add positive and negative review type information for the Top 10 languages.

# create a dataframe for Top 10 languages
top10_review_type_count=(data[data['language'].isin(list(top10['language']))]
 .groupby(['language','review_type'])
 .agg({'reviewId':'count'})
 .reset_index()
)

# update column name
top10_review_type_count.rename(columns={'reviewId':'count'}, inplace=True)

# calculate total number of reviews for each language
total_counts = top10_review_type_count.groupby('language')['count'].sum().reset_index()

# join tables top10_review_type_count and total_counts
merged = top10_review_type_count.merge(total_counts, 
                                       on='language', 
                                       suffixes=('', '_total'))

# sort the dataframe based on total number of reviews in descending order
sorted_df = merged.sort_values(by=['count_total','language'], ascending=[False, False]).reset_index(drop=True)


# add percentage column and display the dataframe
sorted_df['percentage']=round(sorted_df['count']/sorted_df['count_total']*100,2)
sorted_df

Note that in order to keep the integrity of the data, we also utilized language column when we sort the dataframe merged

The table above provides us with

  • total number of reviews per language

  • number of negative and positive reviews per language

  • percentage of negative and positive reviews per language

Adding percentage information to our dataset enabled us to see that though the total number of reviews is not high, the Arabic speaking customers have the highest percentage of negative reviews. Other two languages with higher negative review percentage are English, and Thai. Given the large number of reviews, it is no surprise that the English speaking customers have the highest number of reviews among all languages!

It is a good practice the plot the data when our table has more than 10 rows; however, keep in mind that this number is completely arbitrary!

# add a column to df with language and review_type to be used in the plot
sorted_df['lang_rt']=sorted_df['language']+' - ' +sorted_df['review_type']

# add a column to df with colors based on review_type
colors = {'positive': '#0dd388', 'negative': '#FE6F5E'}
sorted_df['color'] = sorted_df['review_type'].map(colors)

# create a bar plot with alternating colors
ax = sorted_df.plot.bar(x='lang_rt', 
                        y='count', 
                        color=sorted_df['color'], 
                        logy=True, 
                        figsize=(10, 6))

# set labels and title
ax.set_ylabel('Number of Reviews (log scale)')
ax.set_xlabel('Language and Review Type')
ax.set_title('Review Counts \n(Positives in Green, Negatives in Red)')
ax.get_legend().remove()

# add data labels to the graph
for x,y in sorted_df["count"].items():
    plt.annotate(y, (x,y), textcoords="offset points", xytext=(2.5, 5), ha="center")

# display plot
plt.show()

Now, it is time to visually inspect what customers are talking about in their reviews.

2) WordClouds

pip install wordcloud

Prior to visualize the most common words we will

  1. first combine all reviews into one single text string

Keep in mind that we can use stopword sets provided by other NLP packages such as NLTK, or combine multiple lists and create our own stopword list.

# filter only English reviews
language_chosen = 'English'
reviews = data['language']== language_chosen

# Combine all reviews into a single string
all_reviews = '\n'.join(data[reviews]['content_no_emojis'].astype(str))
print (f"There are a total of {len(all_reviews):,} words in the combination of all {language_chosen} reviews.")

# define stop words
stopwords = set(STOPWORDS)

# generate a word cloud excluding stop words
wordcloud = WordCloud(width=800, height=400, 
                      random_state=21, 
                      max_font_size=110,
                      background_color='white', 
                      stopwords=stopwords).generate(all_reviews)

# plot the WordCloud image
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis('off')

# store to file
# plt.savefig("./agoda_reviews_all.jpeg", format="jpeg",dpi=400)

plt.show()
There are a total of 5,249,206 words in the combination of all English reviews.

Here is our wordcloud:

We can also check the top 10 most frequent words with words_ attribute, which is a dictionary, of our cloud. Here is the list of Top 10 most frequent words.

# display most frequent words
c = 1
for k,v in wordcloud.words_.items():
    print (c,"-", k,":",v)
    c+=1
    if c==11:
        break
1 - app : 1.0
2 - hotel : 0.8624198186241981
3 - booking : 0.8534616235346162
4 - Agoda : 0.8202831232028313
5 - easy : 0.6883432868834328
6 - use : 0.6575978765759788
7 - book : 0.3795620437956204
8 - room : 0.3522450785224508
9 - booked : 0.31176730811767306
10 - good : 0.2814642778146428

According to our list the word "app" is the most frequent word, which is not surprising given the fact that the reviews are extracted from the Google Play Store, i.e. for Agoda's app.

It is better to visualize Negative and Positive connotations separately to get a better understanding about reviews. In the following code, we

# filters
language_chosen = 'English'
reviews_in = data['language']== language_chosen
rev_type_neg = data['review_type']=='negative'
rev_type_pos = data['review_type']=='positive'

# Combine negative and positive reviews into separate single strings
neg_reviews = '\n'.join(data[reviews_in & rev_type_neg]['content_no_emojis'].astype(str))
pos_reviews = '\n'.join(data[reviews_in & rev_type_pos]['content_no_emojis'].astype(str))

print (f"\nThere are a total of {len(neg_reviews):,} words in the combination of negative {language_chosen} reviews.")
print (f"\nThere are a total of {len(pos_reviews):,} words in the combination of positive {language_chosen} reviews.\n")


# define stop words
stop_words = set(STOPWORDS)

# update stopwords
stop_words.update(["Agoda", "app","hotel","book","booking","booked","will","said","alway"])


# generate a word cloud excluding stop words
wordcloud_negative = WordCloud(width=800, height=400, 
                      random_state=21, 
                      max_font_size=110,
                      background_color='white', 
                      stopwords=stop_words).generate(neg_reviews)


# generate a word cloud excluding stop words
wordcloud_positive = WordCloud(width=800, height=400, 
                      random_state=21, 
                      max_font_size=110,
                      background_color='white', 
                      stopwords=stop_words).generate(pos_reviews)


# plot the WordCloud images
fig, (ax1, ax2) = plt.subplots(1, 2,figsize=(24, 16))

ax1.axis('off')
ax1.set_title("Negative Reviews", fontsize=18)
ax1.imshow(wordcloud_negative)

ax2.axis('off')
ax2.set_title("Positive Reviews", fontsize=18)
ax2.imshow(wordcloud_positive)

# save the figure
# plt.savefig('agoda_reviews_pos_and_neg.jpeg', format="jpeg", dpi=400, bbox_inches='tight',facecolor='w')

# display the figure
plt.show()
There are a total of 3,502,861 words in the combination of negative English reviews.
There are a total of 1,746,344 words in the combination of positive English reviews.

Now we have a better understanding what people's are having problem with, e.g. rooms, refund, customer service; or what they like the most about Agoda's app, e.g easy to use, fast and best prices.

Conclusion

In this exercise, we performed setiment analysis and created a word cloud visualization for customer reviews using Python. We used customers' ratings to analyze the sentiment of each review, categorizing them as positive and negative. Additionally, we utilized the WordCloud library to generate a visual representation of the most frequently occurring words in the reviews, excluding common stop words.

The sentiment analysis provides a quick overview of the overall tone of customer feedback, helping to understand the general sentiment of the reviews. On the other hand, the word cloud offers a visual snapshot of the most prominent words used by customers, highlighting key themes and topics.

By combining sentiment analysis and word cloud visualization, businesses can gain valuable insights into customer sentiments and identify recurring themes in their reviews. These insights can inform decision-making processes, product improvements, and customer satisfaction strategies.

Sentiment analysis, also known as opinion mining, is the process of determining the sentiment expressed in a piece of text, such as positive, negative, or neutral. It is a natural language processing (NLP) task that involves analyzing and classifying subjective information in text data. Though there are NLP libraries such as NLTK or that provide us with tools for sentiment analysis, we will be extracting the sentiment using score/rating information since they are available to us.

Word clouds are great way to visualize the most frequently occurring words in customer reviews. To do this, we will use the library. Make sure to install the required package first by running:

then, remove the stopwords (check the for more detail) using STOPWORDS provided by the wordcloud library.

TextBlob
WordCloud
Stopwords section
Part I