Agoda Reviews - Part II - Review Analysis and WordClouds

In the Part I of this project, we have extracted Agoda's reviews, detected the languages of the review contents and pre-processed the entire dataset to apply the Exploratory Data Analysis, Sentiment Analysis, and Visualization. Though, it is not required to follow the steps in Part I to perform the analyses in this section, it would be good to check them for the integrity of the project.

In this phase, we will be starting our analyses by importing the data that we have saved in the Part I. However, first thing first, here are the packages will need for this section:

import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
from wordcloud import WordCloud, STOPWORDS

1) Exploratory Data Analysis

Let us import the data and check the first 5 rows:

data = pd.read_csv('./agoda_reviews_231019_with_lang_unknown_dropped.csv')
data.head()

Time to check the overall information about the dataset by using pandas .info() method.

data.info()

.info() shows us that there are 12 columns with object and int64 datatypes, 74306 rows, i.e. records, and the dataset uses 6.8MB memory. The columns content, replyContent, repliedAt, appVersion, and content with emojis have NULL values.

We'll now explore our data to gain insights and identify trends. Since we have already covered the number of reviews per language in Part I, we begin with displaying the score/rating distribution and then check the positive vs negative review distributions:

a) Number of Reviews per Rating

Let us create a summary tables called scores and gather overall information in this dataframe:

scores = (
    data['score']
    .value_counts()
    .reset_index()
    .sort_values(by='score',ascending=False)
)
scores.reset_index(drop=True, inplace=True)
scores

The code snippet above gives us the number of reviews per rating. According to our calculation, there are 46820 5-Star reviews, 8107 4-Star reviews, and so on. Let's plot the results:

# create a bar plot
ax = scores.plot.bar(x='score',
                y='count', 
                rot=0, 
                color='cyan')

# set axes and title parameters
ax.set_title('Rating Distribution')
ax.set_ylabel('Number of Reviews')
ax.set_ylim(0,80000)

# annotate the bars
for x,y in scores["count"].items():
    plt.annotate(text = y, 
                 xy = (x,y), 
                 textcoords="offset points", 
                 xytext=(1,2), 
                 ha="center")

Though, we have the numbers, a better representation would be the relative frequency distribution, i.e. percentages, of each rating group:

scores['pct'] = round(scores['count']/scores['count'].sum()*100,2)
scores

We now know that 5-Star review counts correspond to ~63% of all reviews.

b) Sentiment Analysis: Negative and Positive Reviews

Sentiment analysis, also known as opinion mining, is the process of determining the sentiment expressed in a piece of text, such as positive, negative, or neutral. It is a natural language processing (NLP) task that involves analyzing and classifying subjective information in text data. Though there are NLP libraries such as NLTK or TextBlob that provide us with tools for sentiment analysis, we will be using score/rating information for this project since they are available to us.

"Negative" and "Positive" to tags will be added to the reviews in the following logic:

Reviews with 4 and 5 rating -> will be considered as "Positive"
Reviews with 1,2, and 3 rating -> will be considered as "Negative"

Let us create a binary column called review_type for "positive" or "negative" tags based on our methodology above:

data['review_type'] = np.where(data['score']<=3, 'negative', 'positive')
data.head()

Now add this info to our summary table, scores:

scores['review_type'] = np.where(scores['score']<=3, 'negative', 'positive')
scores

and using pandas groupby function we can the overall positive vs negative reviews distribution:

pos_neg_dist = scores.groupby('review_type').agg({'count':'sum','pct':'sum'})
pos_neg_dist

According to our analysis, Agoda's app has 54927 positive and 19379 negative reviews which correspond to 73.92% and 26.08%, respectively. The percentage of the negative reviews seem to be high and require further analysis, which will be doing in the Sentiment Analysis section; but before that let us see positive of the negative review distribution of the languages with the top 10 most number of reviews (Also, keep in mind that we have considered "rating 3" reviews as negative, which is equivalent of ~2.7% all reviews, and one can argue that this group of reviews belongs to the positive reviews! Even with that the the negative review percentage s 23.38%)

Positive vs Negative Reviews with Rating 3 as Positive

scores['review_type_2'] = np.where(scores['score']<=2, 'negative', 'positive')
scores.groupby('review_type_2').agg({'count':'sum','pct':'sum'})

c) Sentiment analysis over Top 10 Languages with the Most Number of Reviews

Using pandas value_counts method we will extract the top 10 languages with the most number of reviews:

top10 = data['language'].value_counts().reset_index()[:10]
top10

Let's now plot the this data using pandas plot function. Be aware that the number of English reviews are two orders of magnitude larger than that of other languages; therefore, we will use logarithmic scale for plotting:

top10.plot.barh(y='count',
                x='language', 
                grid='yaxis',
                title='Top 10 Languages wiht Highest Number of Reviews',
                xlabel='Number of Reviews',
                ylabel='Languages',
                logx=True)
                
plt.gca().invert_yaxis()

Let's add positive and negative review type information for the Top 10 languages.

# create a dataframe for Top 10 languages
top10_review_type_count=(data[data['language'].isin(list(top10['language']))]
 .groupby(['language','review_type'])
 .agg({'reviewId':'count'})
 .reset_index()
)

# update column name
top10_review_type_count.rename(columns={'reviewId':'count'}, inplace=True)

# calculate total number of reviews for each language
total_counts = top10_review_type_count.groupby('language')['count'].sum().reset_index()

# join tables top10_review_type_count and total_counts
merged = top10_review_type_count.merge(total_counts, 
                                       on='language', 
                                       suffixes=('', '_total'))

# sort the dataframe based on total number of reviews in descending order
sorted_df = merged.sort_values(by=['count_total','language'], ascending=[False, False]).reset_index(drop=True)


# add percentage column and display the dataframe
sorted_df['percentage']=round(sorted_df['count']/sorted_df['count_total']*100,2)
sorted_df

Note that in order to keep the integrity of the data,we also utilized language column when we sort the dataframe merged

The table above provides us with

total number of reviews per language
number of negative and positive reviews per language
percentage of negative and positive reviews per language

Adding percentage information to our dataset enabled us to see that though the total number of reviews is not high, the Arabic speaking customers have the highest percentage of negative reviews. Other two languages with higher negative review percentage are English, and Thai. Given the large number of reviews, it is no surprise that the English speaking customers have the highest number of reviews among all languages!

It is a good practice the plot the data when our table has more than 10 rows; however, keep in mind that this number is completely arbitrary!

# add a column to df with language and review_type to be used in the plot
sorted_df['lang_rt']=sorted_df['language']+' - ' +sorted_df['review_type']

# add a column to df with colors based on review_type
colors = {'positive': '#0dd388', 'negative': '#FE6F5E'}
sorted_df['color'] = sorted_df['review_type'].map(colors)

# create a bar plot with alternating colors
ax = sorted_df.plot.bar(x='lang_rt', 
                        y='count', 
                        color=sorted_df['color'], 
                        logy=True, 
                        figsize=(10, 6))

# set labels and title
ax.set_ylabel('Number of Reviews (log scale)')
ax.set_xlabel('Language and Review Type')
ax.set_title('Review Counts \n(Positives in Green, Negatives in Red)')
ax.get_legend().remove()

# add data labels to the graph
for x,y in sorted_df["count"].items():
    plt.annotate(y, (x,y), textcoords="offset points", xytext=(2.5, 5), ha="center")

# display plot
plt.show()

Now, it is time to visually inspect what customers talk about in their reviews

2) WordClouds

Word clouds are great way to visualize the most frequently occurring words in customer reviews. To do this, we will use the WordCloud library. Make sure to install the required package first by running:

pip install wordcloud

Prior to visualize the most common words we will

first combine all reviews into one single text string
then, remove the stopwords (check the Stopwords section for more detail) using STOPWORDS provided by the wordcloud library.

Keep in mind that we can use stopword sets provided by other NLP packages such as NLTK, or combine multiple lists and create our own stopword list.

# filter only English reviews
language_chosen = 'English'
reviews = data['language']== language_chosen

# Combine all reviews into a single string
all_reviews = '\n'.join(data[reviews]['content_no_emojis'].astype(str))
print (f"There are a total of {len(all_reviews):,} words in the combination of all {language_chosen} reviews.")

# define stop words
stopwords = set(STOPWORDS)

# generate a word cloud excluding stop words
wordcloud = WordCloud(width=800, height=400, 
                      random_state=21, 
                      max_font_size=110,
                      background_color='white', 
                      stopwords=stopwords).generate(all_reviews)

# plot the WordCloud image
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis('off')

# store to file
# plt.savefig("./agoda_reviews_all.jpeg", format="jpeg",dpi=400)

plt.show()

There are a total of 5,249,206 words in the combination of all English reviews.

Here is our wordcloud:

We can also check the top 10 most frequent words with words_ attribute, which is a dictionary, of our cloud. Here is the list of Top 10 most frequent words.

# display most frequent words
c = 1
for k,v in wordcloud.words_.items():
    print (c,"-", k,":",v)
    c+=1
    if c==11:
        break

1 - app : 1.0
2 - hotel : 0.8624198186241981
3 - booking : 0.8534616235346162
4 - Agoda : 0.8202831232028313
5 - easy : 0.6883432868834328
6 - use : 0.6575978765759788
7 - book : 0.3795620437956204
8 - room : 0.3522450785224508
9 - booked : 0.31176730811767306
10 - good : 0.2814642778146428

According to our list the word "app" is the most frequent word, which is not surprising given the fact that the reviews are extracted from the Google Play Store, i.e. for Agoda's app.

It is better to visualize Negative and Positive connotations separately to get a better understanding about reviews. In the following code, we

# filters
language_chosen = 'English'
reviews_in = data['language']== language_chosen
rev_type_neg = data['review_type']=='negative'
rev_type_pos = data['review_type']=='positive'

# Combine negative and positive reviews into separate single strings
neg_reviews = '\n'.join(data[reviews_in & rev_type_neg]['content_no_emojis'].astype(str))
pos_reviews = '\n'.join(data[reviews_in & rev_type_pos]['content_no_emojis'].astype(str))

print (f"\nThere are a total of {len(neg_reviews):,} words in the combination of negative {language_chosen} reviews.")
print (f"\nThere are a total of {len(pos_reviews):,} words in the combination of positive {language_chosen} reviews.\n")


# define stop words
stop_words = set(STOPWORDS)

# update stopwords
stop_words.update(["Agoda", "app","hotel","book","booking","booked","will","said","alway"])


# generate a word cloud excluding stop words
wordcloud_negative = WordCloud(width=800, height=400, 
                      random_state=21, 
                      max_font_size=110,
                      background_color='white', 
                      stopwords=stop_words).generate(neg_reviews)


# generate a word cloud excluding stop words
wordcloud_positive = WordCloud(width=800, height=400, 
                      random_state=21, 
                      max_font_size=110,
                      background_color='white', 
                      stopwords=stop_words).generate(pos_reviews)


# plot the WordCloud images
fig, (ax1, ax2) = plt.subplots(1, 2,figsize=(24, 16))

ax1.axis('off')
ax1.set_title("Negative Reviews", fontsize=18)
ax1.imshow(wordcloud_negative)

ax2.axis('off')
ax2.set_title("Positive Reviews", fontsize=18)
ax2.imshow(wordcloud_positive)

# save the figure
# plt.savefig('agoda_reviews_pos_and_neg.jpeg', format="jpeg", dpi=400, bbox_inches='tight',facecolor='w')

# display the figure
plt.show()

There are a total of 3,502,861 words in the combination of negative English reviews.
There are a total of 1,746,344 words in the combination of positive English reviews.

Now we have a better understanding what people's are having problem with, e.g. rooms, refund, customer service, or what they like to the most about Agoda's app, e.g easy to use, fast and best prices.

Conclusion

In this exercise, we performed review analysis and created a word cloud visualization for customer reviews using Python. We used customers' ratings to analyze the sentiment of each review, categorizing them as positive and negative. Additionally, we utilized the WordCloud library to generate a visual representation of the most frequently occurring words in the reviews, excluding common stop words.

The sentiment analysis provides a quick overview of the overall tone of customer feedback, helping to understand the general sentiment of the reviews. On the other hand, the word cloud offers a visual snapshot of the most prominent words used by customers, highlighting key themes and topics.

By combining sentiment analysis and word cloud visualization, businesses can gain valuable insights into customer sentiments and identify recurring themes in their reviews. These insights can inform decision-making processes, product improvements, and customer satisfaction strategies.

Last updated 10 days ago