Agoda Reviews - Part I - Scraping Reviews, Detecting Languages, and Preprocessing
Last updated
Last updated
Before we dive into the tutorial, let me briefly introduce Agoda.com. Agoda is a renowned online travel booking platform, a part of Booking Holdings Inc. (also own Booking.com, Priceline.com, Kayak.com, and more) family, that allows users to search, book, and review accommodations, flights, and various travel-related services. With a global presence and a vast array of offerings, Agoda attracts millions of users and accumulates a treasure trove of user reviews on the Google Play Store. Analyzing these reviews can offer invaluable insights into customer sentiment, satisfaction, and areas for potential improvement.
In today's data-driven world, understanding customer feedback is crucial for businesses, and web scraping is a powerful tool to gather insights from public reviews. In this tutorial, we'll explore the google_play_scraper package, to extract Agoda's reviews from Google Play Store combined with popular data manipulation and visualization libraries like Pandas, NumPy, Matplotlib, and WordCloud, to dissect and visualize the sentiment and content of Agoda.com's app reviews.
I will be using Jupyter Notebook in this exercise for this exercise, but feel free use your choice of IDE or run your python from terminal or command window.
To follow this tutorial, you'll need the following prerequisites:
Python: Ensure you have Python installed on your system. If not, you can download it from Python's official website.
Python Libraries: Make sure to have the following libraries installed:
google_play_scraper
: For scraping Google Play Store reviews.
pandas
: For data manipulation and analysis.
numpy
: For numerical operations.
matplotlib
: For data visualization.
wordcloud
: For creating word clouds from text data.
emoji
: For removing emojis.
langdetect
: For language detection.
iso639
: For converting ISO 639-1 language codes to language names.
You can install these libraries using pip
: (copy and past the following line of code to your terminal/command window)
Now, let's dive into the steps of our tutorial. We'll go through the process of scraping Agoda.com's Google Play Store reviews and performing basic data analysis. Here's a brief overview of the steps:
Part I - Scraping Reviews, Detecting Languages, Preprocessing
Importing Required Packages: We'll start by importing the necessary Python packages
Scraping Agoda.com's Google Play Reviews: We'll use the google_play_scraper
package to fetch reviews from the Google Play Store.
Data Preprocessing: We'll clean and preprocess the data to make it suitable for analysis.
Part II - Review Analysis, Sentiment Analysis, Visualization
Exploratory Data Analysis (EDA): We'll conduct an initial exploration of the data to gain insights and identify trends.
Sentiment Analysis: We'll perform sentiment analysis to understand customer sentiment and satisfaction.
Data Visualization: Utilizing matplotlib
and wordcloud
, we'll create visualizations to represent the findings.
Conclusion and Next Steps: We'll wrap up the tutorial by summarizing our findings and suggesting potential next steps.
Get ready to embark on this exciting journey of data analysis and gain the skills to extract meaningful insights from user reviews on the Google Play Store. Let's begin!
You can access App's Ids from the url on the Google Play Stores.
We will be using google_play_scraper's reviews_all
function with default settings to scrape all the reviews.
The dataframe has 74853 rows and 11 columns. Lets take a look at first five rows:
You might have heard the phrase "Garbage in, garbage out". This phrase is particularly used for machine learning algorithms but also holds true for any type data processing procedures. As Data Scientists/Analysts we should make sure we process 'correct' data so that we can draw accurate conclusions, and we do this in the preprocessing step. Therefore, the preprocessing stage forms one of the most important, and the most time-consuming part of any data analyses.
We will start with removing columns that we will not need for this project; then we will be cleaning our data by removing emojis from the review contents.
Removing columns that we are not going to be using will allow us to save on time and the resources; and hence, is important in particular for large datasets.
Some of the review contents have emojis none of which will be used if our analysis, and thus, will be removed:
In order to remove emojis we will create a function, using the emoji library, and apply it on every content, i.e. row, individually using pandas' apply
method:
We can perform emoji removal on the 'content' column or create a new one to keep the original one untouched to make comparison.
As we can see in the output, the emoji from the content has been removed for the content with index number 55.
Since not all reviews are in English, we will be using langdetect library to detect the language of the review. As we did with the emoji removal, we will create a function to detect the language, using langdetect package's detect
function of a given text and apply it to every review content in the dataset:
Now, we have the language codes, ISO 639-1 Codes (4th in the table), in the table.
Note that the languages detected for reviews with the index numbers 1, and 4 are not English, even though the reviews are in English. It is because language detection algorithm is non-deterministic, and we may receive different detection results every time we run it on a text that is too short or ambiguous.
We will address this issue after next step. Lets, first convert those ISO 639-1 codes to language names with python's iso639 package.
We can see the language name for each review in the language
column at the far right. Now time to address the incorrect language detection issue.
First, let's check the number of languages and the reviews per language:
There are in total 49 languages (first 20 rows are shown above), and as one can intuitively doubt that the likelihood of having that many reviews in Somali or in Afrikaans is very little. Let's check 50 reviews (the first or alternatively the last 50 reviews for each language):
Note that in both languages the reviews are almost entirely in English, with very few exceptions as we expected. Also note that the mismatched reviews are mostly single or two-word English phrases. We can use this information and do a little more digging on the reviews to see what other terms have been mis-detected. The following list shows some of those terms I found during my analysis:
Time to match reviews that have the terms listed above to English. Before doing that, let's copy the dataframe since we will be applying a major change in the dataset:
Now the results seem more reasonable! Number of English reviews went up to 72655 after our update. Keep in mind that due to Agoda 's origin and marketplace, mostly serves in Asia, the same process can also be applied to regional languages like Thai or Indonesian. Though, we still see mis-matches in the dataset because of abbreviations or misspellings in the reviews, results are good enough and we can proceed to the next step in our analysis. However, note that there are reviews with 'Unknown' language, which corresponds to the 2nd highest review counts. Let's take a closer look.
Now we will figure out the reason why langdetect
couldn't detect the languages for certain reviews. Below, we use groupby
with 'language_code' and 'language' columns to increase granularity and also use chain style formatting to read the code in an easier way:
Out of 554 Unknown languages, 547 reviews have truly been not detected. In addition, iso639
package couldn't extract the names for the languages with the code 'zh-cn', which is Chinese. Lets fix the latter first:
Now, we can check the source, i.e. review contents, for 'Unknown' languages:
The output of the first 10 rows show us that the cause of Unknown language type is the "emojis"! When we do a manual visual inspection of the rows, we also see that there are reviews with
punctuations, such as :) or ^-^,
percentages, i.e. 100%,
no content, i.e. None.
There are also 5 reviewId
s
['5dba0a1c-7e94-470f-911a-b2e0f807908b', '92f547a4-d988-4af9-9ce2-82909cbb9284', '36f36e48-53e5-47c0-ae2c-5a32ae8d4711', 'eb45309b-0080-42a9-bad8-fd4efa3506e7', '48638631-4381-4aff-ae7e-55a9e2f4106b']
for which langdetect
was unable to detect languages.
At this point, we can either drop all the rows with that contain above characters or keep the data as is. We will do the former and drop them from our dataset since the number of undetected reviews is very small.
Time to save the dataset:
With the last step we conclude Part I of Agoda Reviews project and can now move on to Part II - Review Analysis, Sentiment Analysis, Visualization.