Agoda Reviews - Part I - Scraping Reviews, Detecting Languages, Preprocessing

Analyzing Agoda.com's Google Play Reviews with Python

Before we dive into the tutorial, let me briefly introduce Agoda.com. Agoda is a renowned online travel booking platform, a part of Booking Holdings Inc. (also own Booking.com, Priceline.com, Kayak.com, and more) family, that allows users to search, book, and review accommodations, flights, and various travel-related services. With a global presence and a vast array of offerings, Agoda attracts millions of users and accumulates a treasure trove of user reviews on the Google Play Store. Analyzing these reviews can offer invaluable insights into customer sentiment, satisfaction, and areas for potential improvement.

In today's data-driven world, understanding customer feedback is crucial for businesses, and web scraping is a powerful tool to gather insights from public reviews. In this tutorial, we'll explore the google_play_scraper package, to extract Agoda's reviews from Google Play Store combined with popular data manipulation and visualization libraries like Pandas, NumPy, Matplotlib, and WordCloud, to dissect and visualize the sentiment and content of Agoda.com's app reviews.

I will be using Jupyter Notebook in this exercise for this exercise, but feel free use your choice of IDE or run your python from terminal or command window.

Prerequisites

To follow this tutorial, you'll need the following prerequisites:

  1. Python: Ensure you have Python installed on your system. If not, you can download it from Python's official website.

  2. Python Libraries: Make sure to have the following libraries installed:

    • google_play_scraper: For scraping Google Play Store reviews.

    • pandas: For data manipulation and analysis.

    • numpy: For numerical operations.

    • matplotlib: For data visualization.

    • wordcloud: For creating word clouds from text data.

    • emoji: For removing emojis.

    • langdetect : For language detection.

    • iso639: For converting ISO 639-1 language codes to language names.

You can install these libraries using pip: (copy and past the following line of code to your terminal/command window)

pip install google-play-scraper pandas numpy matplotlib wordcloud emoji langdetect iso639 

Step-by-Step Guide

Now, let's dive into the steps of our tutorial. We'll go through the process of scraping Agoda.com's Google Play Store reviews and performing basic data analysis. Here's a brief overview of the steps:

Part I - Scraping Reviews, Detecting Languages, Preprocessing

  1. Importing Required Packages: We'll start by importing the necessary Python packages

  2. Scraping Agoda.com's Google Play Reviews: We'll use the google_play_scraper package to fetch reviews from the Google Play Store.

  3. Data Preprocessing: We'll clean and preprocess the data to make it suitable for analysis.

Part II - Review Analysis, Sentiment Analysis, Visualization

  1. Exploratory Data Analysis (EDA): We'll conduct an initial exploration of the data to gain insights and identify trends.

  2. Sentiment Analysis: We'll perform sentiment analysis to understand customer sentiment and satisfaction.

  3. Data Visualization: Utilizing matplotlib and wordcloud, we'll create visualizations to represent the findings.

  4. Conclusion and Next Steps: We'll wrap up the tutorial by summarizing our findings and suggesting potential next steps.

Get ready to embark on this exciting journey of data analysis and gain the skills to extract meaningful insights from user reviews on the Google Play Store. Let's begin!

1) Importing Required Packages

from google_play_scraper import reviews_all 
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
from wordcloud import WordCloud, STOPWORDS
import emoji
from langdetect import detect 
from iso639 import languages

2) Scraping Agoda.com's Google Play Reviews

You can access App's Ids from the url on the Google Play Stores.

We will be using google_play_scraper's reviews_all function with default settings to scrape all the reviews.

# App ID for the app
app_id = "com.agoda.mobile.consumer"

# Scrape the reviews
reviews = reviews_all(app_id)

# Put the reviews into a Pandas DataFrame
reviews_df = pd.DataFrame(reviews)

# Print summary of a DataFrame
reviews_df.info()

The dataframe has 74853 rows and 11 columns. Lets take a look at first five rows:

# Print the first 5 reviews
reviews_df.head(5)

3) Data Preprocessing

You might have heard the phrase "Garbage in, garbage out". This phrase is particularly used for machine learning algorithms but also holds true for any type data processing procedures. As Data Scientists/Analysts we should make sure we process 'correct' data so that we can draw accurate conclusions, and we do this in the preprocessing step. Therefore, the preprocessing stage forms one of the most important, and the most time-consuming part of any data analyses.

We will start with removing columns that we will not need for this project; then we will be cleaning our data by removing emojis from the review contents.

a) Remove Extra Columns

Removing columns that we are not going to be using will allow us to save on time and the resources; and hence, is important in particular for large datasets.

# copy dataframe 
reviews_df_copied = reviews_df.copy(deep=True)

# columns to remove
columns_remove = ['userImage','reviewCreatedVersion']

# remove columns
reviews_df_copied = reviews_df_copied.drop(columns=columns_remove, axis = 1)

# diplay row, col counts
print(reviews_df.shape) 
print(reviews_df_copied.shape)

# (74853, 11)
# (74853, 9)

b) Removing Emojis

Some of the review contents have emojis none of which will be used if our analysis, and thus, will be removed:

reviews_df.loc[55,'content']

In order to remove emojis we will create a function, using the emoji library, and apply it on every content, i.e. row, individually using pandas' apply method:

# function to remove emojis
def remove_emojis(s):
    try:
        text_no_emoji = emoji.replace_emoji(s)
    except TypeError as e:
        text_no_emoji=''
    return text_no_emoji

We can perform emoji removal on the 'content' column or create a new one to keep the original one untouched to make comparison.

# apply the function
reviews_df_copied['content_no_emojis'] = reviews_df_copied['content'].apply(remove_emojis)

# display content
reviews_df_copied.loc[55, ['content','content_no_emojis']].to_frame().T

As we can see in the output, the emoji from the content has been removed for the content with index number 55.

c) Detect Languages

Since not all reviews are in English, we will be using langdetect library to detect the language of the review. As we did with the emoji removal, we will create a function to detect the language, using langdetect package's detect function of a given text and apply it to every review content in the dataset:

# function to detect language using langdetect
def detect_language(text):
    try:
        return detect(text)
    except:
        return "Unknown"

# apply the language detection function to each review
reviews_df_copied['language_code'] = reviews_df_copied['content_no_emojis'].str.lower().apply(detect_language)

# display the result
reviews_df_copied.head()

Now, we have the language codes, ISO 639-1 Codes (4th in the table), in the table.

Note that the languages detected for reviews with the index numbers 1, and 4 are not English, even though the reviews are in English. It is because language detection algorithm is non-deterministic, and we may receive different detection results every time we run it on a text that is too short or ambiguous.

d) Extracting Language Names

We will address this issue after next step. Lets, first convert those ISO 639-1 codes to language names with python's iso639 package.

# function to detect language name using iso639.languages 
def detect_language_name(text):
    try:
        return languages.get(part1=text).name
    except:
        return "Unknown"


# apply the language name detection function to each review
reviews_df_copied['language'] = reviews_df_copied['language_code'].apply(detect_language_name)

# diplay results
reviews_df_copied.head()

We can see the language name for each review in the language column at the far right. Now time to address the incorrect language detection issue.

e) Language Detection Fix

First, let's check the number of languages and the reviews per language:

number of languages
print(f"Number of languages: {reviews_df_copied.language.nunique()}")

# check review counts based on language
reviews_df_copied['language'].value_counts().reset_index()

#Number of languages: 49

There are in total 49 languages (first 20 rows are shown above), and as one can intuitively doubt that the likelihood of having that many reviews in Somali or in Afrikaans is very little. Let's check 50 reviews (the first or alternatively the last 50 reviews for each language):

# review samples for Somali
reviews_df_copied[reviews_df_copied['language']=='Somali']['content_no_emojis'][:50]
# reviews_df_copied[reviews_df_copied['language']=='Somali']['content_no_emojis'][-50:]

# review samples for Afrikaans
reviews_df_copied[reviews_df_copied['language']=='Afrikaans']['content_no_emojis'][:50]
# reviews_df_copied[reviews_df_copied['language']=='Afrikaans']['content_no_emojis'][-50:]

Note that in both languages the reviews are almost entirely in English, with very few exceptions as we expected. Also note that the mismatched reviews are mostly single or two-word English phrases. We can use this information and do a little more digging on the reviews to see what other terms have been mis-detected. The following list shows some of those terms I found during my analysis:

english_terms = ['amazing','advertising','advertisement','average','awful','awesome','awesomeness','a+',
                 'best','better','brilliant','bravo',
                 'comfortable','company','convenience','convenient','coupon', 'customer service',
                 'deal', 'discount',
                 'easy to','easy booking','efficient','excellent','exceptional','expensive','experience', 
                 'fake','false','fantastic','fast','friendly','fraud','fraudulent',
                 'good','great',
                 'happy','hassle','helpful','horrible','hidden cost',
                 'like','liked','love','loved','luv it', 
                 'marvelous', 
                 'nice','nice hotel','not bad','no comment','not working',
                 'okay','okey','ok','outstanding'
                 'perfect','poor','process',
                 'quality','quick',
                 'recommended','reliable','response',
                 'satisfied','scam','scammer','seamless','simple','slow','smart','so far','star','stunning','super',
                 'thank','thx','to use','transaction',
                 'useful','useless',
                 'value','very bad','very nice', 'very well'	
                 'wow','wonderful','worse','worst','works'
                 'yeah','yes']

Time to match reviews that have the terms listed above to English. Before doing that, let's copy the dataframe since we will be applying a major change in the dataset:

# copy the dataframe
reviews_df_cp = reviews_df_copied.copy(deep=True)

# create a filter for 'non-english' reviews
not_english = reviews_df_cp['language_code']!='en'

# update reivews' languages
for t in english_terms:
    reviews_df_cp.loc[(reviews_df_cp['content_no_emojis'].str.lower().str.contains(t)) & (not_english),'language_code']='en'

# apply the language name detection function to each review
reviews_df_cp['language'] = reviews_df_cp['language_code'].apply(detect_language_name)

# display first 25 languages along with their review counts
reviews_df_cp['language'].value_counts().reset_index()[:25]

Now the results seem more reasonable! Number of English reviews went up to 72655 after our update. Keep in mind that due to Agoda 's origin and marketplace, mostly serves in Asia, the same process can also be applied to regional languages like Thai or Indonesian. Though, we still see mis-matches in the dataset because of abbreviations or misspellings in the reviews, results are good enough and we can proceed to the next step in our analysis. However, note that there are reviews with 'Unknown' language, which corresponds to the 2nd highest review counts. Let's take a closer look.

f) Unknown Language Fix

Now we will figure out the reason why langdetect couldn't detect the languages for certain reviews. Below, we use groupby with 'language_code' and 'language' columns to increase granularity and also use chain style formatting to read the code in an easier way:

# calculate the reviews per language_code and language
language_counts = (reviews_df_cp 
.groupby(['language_code','language']) 
.agg('count')['reviewId'] 
.reset_index() 
.sort_values(by='reviewId',ascending=False) 
.reset_index(drop=True) 
) 

language_counts[language_counts['language']=='Unknown']

Out of 554 Unknown languages, 547 reviews have truly been not detected. In addition, iso639 package couldn't extract the names for the languages with the code 'zh-cn', which is Chinese. Lets fix the latter first:

# update language names for Chinese
reviews_df_cp.loc[(reviews_df_cp['language_code']=='zh-cn'),'language']='Chinese'



# re-calculate the reviews per language_code and language
language_counts = (reviews_df_cp
 .groupby(['language_code','language'])
 .agg('count')['reviewId']
 .reset_index()
 .sort_values(by='reviewId',ascending=False)
 .reset_index(drop=True)
)

# display results
language_counts[language_counts['language']=='Unknown']

Now, we can check the source, i.e. review contents, for 'Unknown' languages:

# display the first 10 rows
reviews_df_cp[reviews_df_cp['language']=='Unknown'].head(10)

The output of the first 10 rows show us that the cause of Unknown language type is the "emojis"! When we do a manual visual inspection of the rows, we also see that there are reviews with

  • punctuations, such as :) or ^-^,

  • percentages, i.e. 100%,

  • no content, i.e. None.

  • There are also 5 reviewId s

    • ['5dba0a1c-7e94-470f-911a-b2e0f807908b', '92f547a4-d988-4af9-9ce2-82909cbb9284', '36f36e48-53e5-47c0-ae2c-5a32ae8d4711', 'eb45309b-0080-42a9-bad8-fd4efa3506e7', '48638631-4381-4aff-ae7e-55a9e2f4106b']

    for which langdetect was unable to detect languages.

At this point, we can either drop all the rows with that contain above characters or keep the data as is. We will do the former and drop them from our dataset since the number of undetected reviews is very small.

# drop Unknown language type
data = reviews_df_cp[~(reviews_df_cp['language']=='Unknown')].reset_index(drop=True)

# print results
print("Before: ",reviews_df_cp.shape)
print("After: ",data.shape)
print("Difference: ",reviews_df_cp.shape[0] - data.shape[0])
Before:  (74853, 12)
After:  (74306, 12)
Difference:  547

Time to save the dataset:

data.to_csv('./agoda_reviews_231019_with_lang_unknown_dropped.csv', index=False, escapechar='\\')

With the last step we conclude Part I of Agoda Reviews project and can now move on to Part II - Review Analysis, Sentiment Analysis, Visualization.

Last updated