# Agoda Reviews - Part II - Sentiment Analysis and WordClouds

In the [Part I](/ds-hub/machine-learning/natural-language-processing/sentiment-analysis/agoda-reviews-part-i-scraping-reviews-detecting-languages-and-preprocessing.md) of this project, we have extracted Agoda's reviews, detected the languages of the review contents and pre-processed the entire dataset to apply the Exploratory Data Analysis, Sentiment Analysis, and Visualization. Though, it is not required to follow the steps in Part I to perform the analyses in this section, it would be good to check them for the integrity of the project.

In this phase, we will be starting our analyses by importing the data that we have saved in the Part I. However, first thing first, here are the packages will need for this section:

```python
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
from wordcloud import WordCloud, STOPWORDS
```

## 1) Exploratory Data Analysis

Let us import the data and check the first 5 rows:

```python
data = pd.read_csv('./agoda_reviews_231019_with_lang_unknown_dropped.csv')
data.head()
```

<figure><img src="/files/2oYEzw02WJwOC5Suo0Fi" alt=""><figcaption></figcaption></figure>

Time to check the overall information about the dataset by using pandas `.info()` method.

```python
data.info()
```

<figure><img src="/files/PYtSzkIG4CFD1wZXiQtf" alt=""><figcaption></figcaption></figure>

`.info()` shows us that there are 12 columns with object and int64 datatypes, 74306 rows, i.e. records, and the dataset uses 6.8MB memory. The columns content, replyContent, repliedAt, appVersion, and content with emojis have NULL values.

We'll now explore our data to gain insights and identify trends. Since we have already covered the number of reviews per language in Part I, we begin with displaying the score/rating distribution and then check the positive vs negative review distributions:&#x20;

### a) Number of Reviews per Rating

Let us create a summary tables called `scores` and gather overall information in this dataframe:

```python
scores = (
    data['score']
    .value_counts()
    .reset_index()
    .sort_values(by='score',ascending=False)
)
scores.reset_index(drop=True, inplace=True)
scores
```

<figure><img src="/files/eNSZNppJ7PsCawZZZWvc" alt="" width="118"><figcaption></figcaption></figure>

The code snippet above gives us the number of reviews per rating. According to our calculation, there are 46820 5-Star reviews, 8107 4-Star reviews, and so on. Let's plot the results:

```python
# create a bar plot
ax = scores.plot.bar(x='score',
                y='count', 
                rot=0, 
                color='cyan')

# set axes and title parameters
ax.set_title('Rating Distribution')
ax.set_ylabel('Number of Reviews')
ax.set_ylim(0,80000)

# annotate the bars
for x,y in scores["count"].items():
    plt.annotate(text = y, 
                 xy = (x,y), 
                 textcoords="offset points", 
                 xytext=(1,2), 
                 ha="center")
```

<figure><img src="/files/afTum9CB1jTZH8EFCCx8" alt="" width="563"><figcaption></figcaption></figure>

Though, we have the numbers, a better representation would be the relative frequency distribution, i.e. percentages, of each rating group:

```python
scores['pct'] = round(scores['count']/scores['count'].sum()*100,2)
scores
```

<figure><img src="/files/wnlhORwWjzN4Dwp1Kuyj" alt="" width="144"><figcaption></figcaption></figure>

We now know that 5-Star review counts correspond to \~63% of all reviews.&#x20;

### b) Sentiment Analysis: Negative and Positive Reviews

Sentiment analysis, also known as opinion mining, is the process of determining the sentiment expressed in a piece of text, such as positive, negative, or neutral. It is a natural language processing (NLP) task that involves analyzing and classifying subjective information in text data. Though there are NLP libraries such as NLTK or [TextBlob](https://textblob.readthedocs.io/en/dev/) that provide us with tools for sentiment analysis, we will be extracting the sentiment using score/rating information since they are available to us.

"Negative" and "Positive" to tags will be added to the reviews in the following logic:

* Reviews with 4 and 5 rating -> will be considered as "Positive"
* Reviews with 1,2, and 3 rating -> will be considered as "Negative"

Let us create a binary column called `review_type` for "positive" or "negative" tags based on our methodology above:

```python
data['review_type'] = np.where(data['score']<=3, 'negative', 'positive')
data.head()
```

<figure><img src="/files/Wf83yWuYzVYHDo8hfY8i" alt=""><figcaption></figcaption></figure>

Now add this info to our summary table, `scores`:

```python
scores['review_type'] = np.where(scores['score']<=3, 'negative', 'positive')
scores
```

<figure><img src="/files/NHhOolrHcD85VEUfWETd" alt=""><figcaption></figcaption></figure>

and using pandas `groupby` function we can the overall positive vs negative reviews distribution:

```python
pos_neg_dist = scores.groupby('review_type').agg({'count':'sum','pct':'sum'})
pos_neg_dist
```

<figure><img src="/files/0XZPYi6S3amkw7ngCQJT" alt=""><figcaption></figcaption></figure>

According to our analysis, Agoda's app has 54927 positive and 19379 negative reviews which correspond to  73.92% and 26.08%, respectively. The percentage of the negative reviews seem to be high and require further analysis, which we will be doing in the next section; but before that let us visualize the review distribution of the languages with the top 10 most number of reviews (Also, keep in mind that we have considered "rating 3" reviews as negative, which is equivalent of \~2.7% all reviews, and one can argue that this group of reviews belongs to the positive reviews! Even with that the the negative review percentage s 23.38%)

<details>

<summary>Positive vs Negative Reviews with Rating 3 as Positive</summary>

```python
scores['review_type_2'] = np.where(scores['score']<=2, 'negative', 'positive')
scores.groupby('review_type_2').agg({'count':'sum','pct':'sum'})
```

<img src="/files/fIbdaYh4dMo2rCMLTyYT" alt="" data-size="original">

</details>

### c) Sentiment analysis over Top 10 Languages with the Most Number of Reviews

Using pandas value\_counts method we will extract the top 10 languages with the most number of reviews:

```python
top10 = data['language'].value_counts().reset_index()[:10]
top10
```

<figure><img src="/files/pUNnPb59mQczEdWwWkAj" alt=""><figcaption></figcaption></figure>

Let's now plot the this data using pandas `plot` function. Be aware that the number of English reviews are two orders of magnitude larger than that of other languages; therefore, we will use logarithmic scale for plotting:

```python
top10.plot.barh(y='count',
                x='language', 
                grid='yaxis',
                title='Top 10 Languages wiht Highest Number of Reviews',
                xlabel='Number of Reviews',
                ylabel='Languages',
                logx=True)
                
plt.gca().invert_yaxis()
```

<figure><img src="/files/73Ah1v6fEAv9gJXo27u6" alt=""><figcaption></figcaption></figure>

Let's add positive and negative review type information for the Top 10 languages.

```python
# create a dataframe for Top 10 languages
top10_review_type_count=(data[data['language'].isin(list(top10['language']))]
 .groupby(['language','review_type'])
 .agg({'reviewId':'count'})
 .reset_index()
)

# update column name
top10_review_type_count.rename(columns={'reviewId':'count'}, inplace=True)

# calculate total number of reviews for each language
total_counts = top10_review_type_count.groupby('language')['count'].sum().reset_index()

# join tables top10_review_type_count and total_counts
merged = top10_review_type_count.merge(total_counts, 
                                       on='language', 
                                       suffixes=('', '_total'))

# sort the dataframe based on total number of reviews in descending order
sorted_df = merged.sort_values(by=['count_total','language'], ascending=[False, False]).reset_index(drop=True)


# add percentage column and display the dataframe
sorted_df['percentage']=round(sorted_df['count']/sorted_df['count_total']*100,2)
sorted_df
```

{% hint style="info" %}
Note that in order to keep the integrity of the data, we also utilized `language` column when we sort the dataframe `merged`
{% endhint %}

<figure><img src="/files/2O8qRIZsBKEIYcbe1Ztk" alt=""><figcaption></figcaption></figure>

The table above provides us with&#x20;

* total number of reviews per language
* number of negative and positive reviews per language
* percentage of negative and positive reviews per language

Adding percentage information to our dataset enabled us to see that though the total number of reviews is not high, the Arabic speaking customers have the highest percentage of negative reviews. Other two languages with higher negative review percentage are English, and Thai. Given the large number of reviews, it is no surprise that the English speaking customers have the highest number of reviews among all languages!

It is a good practice the plot the data when our table has more than 10 rows; however, keep in mind that this number is completely arbitrary!

```python
# add a column to df with language and review_type to be used in the plot
sorted_df['lang_rt']=sorted_df['language']+' - ' +sorted_df['review_type']

# add a column to df with colors based on review_type
colors = {'positive': '#0dd388', 'negative': '#FE6F5E'}
sorted_df['color'] = sorted_df['review_type'].map(colors)

# create a bar plot with alternating colors
ax = sorted_df.plot.bar(x='lang_rt', 
                        y='count', 
                        color=sorted_df['color'], 
                        logy=True, 
                        figsize=(10, 6))

# set labels and title
ax.set_ylabel('Number of Reviews (log scale)')
ax.set_xlabel('Language and Review Type')
ax.set_title('Review Counts \n(Positives in Green, Negatives in Red)')
ax.get_legend().remove()

# add data labels to the graph
for x,y in sorted_df["count"].items():
    plt.annotate(y, (x,y), textcoords="offset points", xytext=(2.5, 5), ha="center")

# display plot
plt.show()
```

<figure><img src="/files/brrudB0RFU5aUvGvYzN2" alt=""><figcaption></figcaption></figure>

Now, it is time to visually inspect what customers are talking about in their reviews.

## 2) WordClouds

Word clouds are great way to visualize the most frequently occurring words in customer reviews. To do this, we will use the [`WordCloud`](https://amueller.github.io/word_cloud/) library. Make sure to install the required package first by running:

```python
pip install wordcloud
```

Prior to visualize the most common words we will&#x20;

1. first combine all reviews into one single text string
2. then, remove the stopwords (check the [Stopwords section](/ds-hub/machine-learning/natural-language-processing/stopwords.md) for more detail) using `STOPWORDS` provided by the `wordcloud` library.&#x20;

{% hint style="info" %}
Keep in mind that we can use stopword sets provided by other NLP packages such as NLTK, or combine multiple lists and create our own stopword list.&#x20;
{% endhint %}

```python
# filter only English reviews
language_chosen = 'English'
reviews = data['language']== language_chosen

# Combine all reviews into a single string
all_reviews = '\n'.join(data[reviews]['content_no_emojis'].astype(str))
print (f"There are a total of {len(all_reviews):,} words in the combination of all {language_chosen} reviews.")

# define stop words
stopwords = set(STOPWORDS)

# generate a word cloud excluding stop words
wordcloud = WordCloud(width=800, height=400, 
                      random_state=21, 
                      max_font_size=110,
                      background_color='white', 
                      stopwords=stopwords).generate(all_reviews)

# plot the WordCloud image
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis('off')

# store to file
# plt.savefig("./agoda_reviews_all.jpeg", format="jpeg",dpi=400)

plt.show()
```

```
There are a total of 5,249,206 words in the combination of all English reviews.
```

Here is our wordcloud:

<figure><img src="/files/3AFGEjXN8GA0KmSxoSMV" alt=""><figcaption></figcaption></figure>

We can also check the top 10 most frequent words with `words_` attribute, which is a dictionary, of our cloud. Here is the list of Top 10 most frequent words.

```python
# display most frequent words
c = 1
for k,v in wordcloud.words_.items():
    print (c,"-", k,":",v)
    c+=1
    if c==11:
        break
```

```
1 - app : 1.0
2 - hotel : 0.8624198186241981
3 - booking : 0.8534616235346162
4 - Agoda : 0.8202831232028313
5 - easy : 0.6883432868834328
6 - use : 0.6575978765759788
7 - book : 0.3795620437956204
8 - room : 0.3522450785224508
9 - booked : 0.31176730811767306
10 - good : 0.2814642778146428
```

According to our list the word "app" is the most frequent word, which is not surprising given the fact that the reviews are extracted from the Google Play Store, i.e. for Agoda's app.

It is better to visualize Negative and Positive connotations separately to get a better understanding about reviews. In the following code, we&#x20;

```python
# filters
language_chosen = 'English'
reviews_in = data['language']== language_chosen
rev_type_neg = data['review_type']=='negative'
rev_type_pos = data['review_type']=='positive'

# Combine negative and positive reviews into separate single strings
neg_reviews = '\n'.join(data[reviews_in & rev_type_neg]['content_no_emojis'].astype(str))
pos_reviews = '\n'.join(data[reviews_in & rev_type_pos]['content_no_emojis'].astype(str))

print (f"\nThere are a total of {len(neg_reviews):,} words in the combination of negative {language_chosen} reviews.")
print (f"\nThere are a total of {len(pos_reviews):,} words in the combination of positive {language_chosen} reviews.\n")


# define stop words
stop_words = set(STOPWORDS)

# update stopwords
stop_words.update(["Agoda", "app","hotel","book","booking","booked","will","said","alway"])


# generate a word cloud excluding stop words
wordcloud_negative = WordCloud(width=800, height=400, 
                      random_state=21, 
                      max_font_size=110,
                      background_color='white', 
                      stopwords=stop_words).generate(neg_reviews)


# generate a word cloud excluding stop words
wordcloud_positive = WordCloud(width=800, height=400, 
                      random_state=21, 
                      max_font_size=110,
                      background_color='white', 
                      stopwords=stop_words).generate(pos_reviews)


# plot the WordCloud images
fig, (ax1, ax2) = plt.subplots(1, 2,figsize=(24, 16))

ax1.axis('off')
ax1.set_title("Negative Reviews", fontsize=18)
ax1.imshow(wordcloud_negative)

ax2.axis('off')
ax2.set_title("Positive Reviews", fontsize=18)
ax2.imshow(wordcloud_positive)

# save the figure
# plt.savefig('agoda_reviews_pos_and_neg.jpeg', format="jpeg", dpi=400, bbox_inches='tight',facecolor='w')

# display the figure
plt.show()
```

```
There are a total of 3,502,861 words in the combination of negative English reviews.
There are a total of 1,746,344 words in the combination of positive English reviews.
```

<figure><img src="/files/30wQDIMbCt4jjzdWPulg" alt=""><figcaption></figcaption></figure>

Now we have a better understanding what people's are having problem with, e.g. rooms, refund, customer service; or what they like the most about Agoda's app, e.g easy to use, fast and best prices.

## Conclusion

In this exercise, we performed setiment analysis and created a word cloud visualization for customer reviews using Python. We used customers' ratings to analyze the sentiment of each review, categorizing them as positive and negative. Additionally, we utilized the WordCloud library to generate a visual representation of the most frequently occurring words in the reviews, excluding common stop words.

The sentiment analysis provides a quick overview of the overall tone of customer feedback, helping to understand the general sentiment of the reviews. On the other hand, the word cloud offers a visual snapshot of the most prominent words used by customers, highlighting key themes and topics.

By combining sentiment analysis and word cloud visualization, businesses can gain valuable insights into customer sentiments and identify recurring themes in their reviews. These insights can inform decision-making processes, product improvements, and customer satisfaction strategies.&#x20;

<br>


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://dshub.gitbook.io/ds-hub/machine-learning/natural-language-processing/sentiment-analysis/agoda-reviews-part-ii-sentiment-analysis-and-wordclouds.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
