Agoda Reviews - Part II - Sentiment Analysis and WordClouds
Last updated
Last updated
In the Part I of this project, we have extracted Agoda's reviews, detected the languages of the review contents and pre-processed the entire dataset to apply the Exploratory Data Analysis, Sentiment Analysis, and Visualization. Though, it is not required to follow the steps in Part I to perform the analyses in this section, it would be good to check them for the integrity of the project.
In this phase, we will be starting our analyses by importing the data that we have saved in the Part I. However, first thing first, here are the packages will need for this section:
Let us import the data and check the first 5 rows:
Time to check the overall information about the dataset by using pandas .info()
method.
.info()
shows us that there are 12 columns with object and int64 datatypes, 74306 rows, i.e. records, and the dataset uses 6.8MB memory. The columns content, replyContent, repliedAt, appVersion, and content with emojis have NULL values.
We'll now explore our data to gain insights and identify trends. Since we have already covered the number of reviews per language in Part I, we begin with displaying the score/rating distribution and then check the positive vs negative review distributions:
Let us create a summary tables called scores
and gather overall information in this dataframe:
The code snippet above gives us the number of reviews per rating. According to our calculation, there are 46820 5-Star reviews, 8107 4-Star reviews, and so on. Let's plot the results:
Though, we have the numbers, a better representation would be the relative frequency distribution, i.e. percentages, of each rating group:
We now know that 5-Star review counts correspond to ~63% of all reviews.
Sentiment analysis, also known as opinion mining, is the process of determining the sentiment expressed in a piece of text, such as positive, negative, or neutral. It is a natural language processing (NLP) task that involves analyzing and classifying subjective information in text data. Though there are NLP libraries such as NLTK or TextBlob that provide us with tools for sentiment analysis, we will be extracting the sentiment using score/rating information since they are available to us.
"Negative" and "Positive" to tags will be added to the reviews in the following logic:
Reviews with 4 and 5 rating -> will be considered as "Positive"
Reviews with 1,2, and 3 rating -> will be considered as "Negative"
Let us create a binary column called review_type
for "positive" or "negative" tags based on our methodology above:
Now add this info to our summary table, scores
:
and using pandas groupby
function we can the overall positive vs negative reviews distribution:
According to our analysis, Agoda's app has 54927 positive and 19379 negative reviews which correspond to 73.92% and 26.08%, respectively. The percentage of the negative reviews seem to be high and require further analysis, which we will be doing in the next section; but before that let us visualize the review distribution of the languages with the top 10 most number of reviews (Also, keep in mind that we have considered "rating 3" reviews as negative, which is equivalent of ~2.7% all reviews, and one can argue that this group of reviews belongs to the positive reviews! Even with that the the negative review percentage s 23.38%)
Using pandas value_counts method we will extract the top 10 languages with the most number of reviews:
Let's now plot the this data using pandas plot
function. Be aware that the number of English reviews are two orders of magnitude larger than that of other languages; therefore, we will use logarithmic scale for plotting:
Let's add positive and negative review type information for the Top 10 languages.
Note that in order to keep the integrity of the data, we also utilized language
column when we sort the dataframe merged
The table above provides us with
total number of reviews per language
number of negative and positive reviews per language
percentage of negative and positive reviews per language
Adding percentage information to our dataset enabled us to see that though the total number of reviews is not high, the Arabic speaking customers have the highest percentage of negative reviews. Other two languages with higher negative review percentage are English, and Thai. Given the large number of reviews, it is no surprise that the English speaking customers have the highest number of reviews among all languages!
It is a good practice the plot the data when our table has more than 10 rows; however, keep in mind that this number is completely arbitrary!
Now, it is time to visually inspect what customers are talking about in their reviews.
Word clouds are great way to visualize the most frequently occurring words in customer reviews. To do this, we will use the WordCloud
library. Make sure to install the required package first by running:
Prior to visualize the most common words we will
first combine all reviews into one single text string
then, remove the stopwords (check the Stopwords section for more detail) using STOPWORDS
provided by the wordcloud
library.
Keep in mind that we can use stopword sets provided by other NLP packages such as NLTK, or combine multiple lists and create our own stopword list.
Here is our wordcloud:
We can also check the top 10 most frequent words with words_
attribute, which is a dictionary, of our cloud. Here is the list of Top 10 most frequent words.
According to our list the word "app" is the most frequent word, which is not surprising given the fact that the reviews are extracted from the Google Play Store, i.e. for Agoda's app.
It is better to visualize Negative and Positive connotations separately to get a better understanding about reviews. In the following code, we
Now we have a better understanding what people's are having problem with, e.g. rooms, refund, customer service; or what they like the most about Agoda's app, e.g easy to use, fast and best prices.
In this exercise, we performed setiment analysis and created a word cloud visualization for customer reviews using Python. We used customers' ratings to analyze the sentiment of each review, categorizing them as positive and negative. Additionally, we utilized the WordCloud library to generate a visual representation of the most frequently occurring words in the reviews, excluding common stop words.
The sentiment analysis provides a quick overview of the overall tone of customer feedback, helping to understand the general sentiment of the reviews. On the other hand, the word cloud offers a visual snapshot of the most prominent words used by customers, highlighting key themes and topics.
By combining sentiment analysis and word cloud visualization, businesses can gain valuable insights into customer sentiments and identify recurring themes in their reviews. These insights can inform decision-making processes, product improvements, and customer satisfaction strategies.