Web Scraping and Mapping
Creating a Map of Major Ports in China using Python and Mapbox's Map
Sedar Sahin, May 2023
Introduction
In this tutorial, we will learn how to create an interactive map showcasing the major ports in China. Port locations become especially important if you ship goods to or from China. Therefore, for those you who (will) work as a Data Scientist (or Analyst) in a logictics, trading, or manufacturing company, it is good to know where the these ports are. These skills you will learn in this practice will help you visualize not only (and taking the step the further and apply geospatial analyses) the locations of ports, but any kind of establishment with latitude and longitude info for stakeholders.
We will use Python's libraries like Requests, BeautifulSoup, Pandas, and Folium to gather the data, manipulate it, and visualize it on the map. Initially, we'll encounter a small challenge with the map's language, but we'll overcome it by utilizing Mapbox's English base map.
Prerequisites:
Basic knowledge of Python
Familiarity with web scraping using BeautifulSoup
Understanding of data manipulation with Pandas
Basic awareness of Folium library for map visualization
Step 1: Installing Required Libraries
First things first, ensure you have Python installed on your system. Then, install the necessary libraries using pip if you haven't already:
Step 2: Web Scraping to Get Major Ports Data
Once we have the necessary pakages install, we can now start working on our favorite Python interface, any IDE of your choice or a Jupyter Notebook.
Because we don't have access to such data with port names along with their latitude and longitude, we need to create it. We will fetch the list of major ports in China from the Wikipedia page (see below) using Python's requests and BeautifulSoup libraries. To do this, we'll write a Python script to scrape the data.
Here is the list of Ports in the tabular form:
1
Dalian
2
Yingkou
3
Jinzhou
4
Qinhuangdao
5
Tianjin
6
Yantai
7
Weihai
8
Qingdao
9
Rizhao
10
Lianyungang
11
Nantong
12
Zhenjiang
13
Jiangyin
14
Nanjing
15
Shanghai
16
Ningbo
17
Zhoushan
18
Jiujiang
19
Taizhou (North of Wenzhou)
20
Wenzhou
21
Taizhou (South of Wenzhou)
22
Changle
23
Quanzhou
24
Xiamen
25
Shantou
26
Jieyang
27
Guangzhou
28
Zhuhai
29
Shenzhen
30
Zhanjiang
31
Beihai
32
Fangchenggang
33
Haikou
34
Basuo
To retrieve coordinates we need to click on each one of the ports on the page. The below screenshot shows the details about the Dalian Port, and its coordinates in two different places on the page both of which highlighted with red rectangles:
The way Wikipedia structured its url is:
"https://en.wikipedia.org/wiki/" + <Port Name>
Using this information we wil be creating urls to access to each port's page in order to extract their coordinates. We can use apps like Excel, Numbers, Google Sheets or simply a text editor to achieve this.
The following table shows the ports with their corresponding Wikipedia pages:
1
Dalian
2
Yingkou
3
Jinzhou
4
Qinhuangdao
5
Tianjin
6
Yantai
7
Weihai
8
Qingdao
9
Rizhao
10
Lianyungang
11
Nantong
12
Zhenjiang
13
Jiangyin
14
Nanjing
15
Shanghai
16
Ningbo
17
Zhoushan
18
Jiujiang
19
Taizhou,_Zhejiang
20
Wenzhou
21
Taizhou (South of Wenzhou)
NA
22
Changle
23
Quanzhou
24
Xiamen
25
Shantou
26
Jieyang
27
Guangzhou
28
Zhuhai
29
Shenzhen
30
Zhanjiang
31
Beihai
32
Fangchenggang
33
Haikou
34
Basuo
Please note that the Port #19 was replaced by "Taizhou,_Zhejiang" in order to retrieve its coordinates and #21 was skipped because there is no geographical coordinate data available on Wikipedia for this port with this name.
Keep in mind extracting URLs can be automated via scraping as well. For the sake of this tutorial, I wanted to use scraping only for one job, which is to extracting coordinates of the ports.
We have links of the ports, it is time to retrieve their coordinates. To do that we first "inspect" the HTML (no need to worry about it for this tutorial) to see via which element/attribute the latitude and longitute information is being stored.
The image below shows that latitude and longitudes are stored in HTML attribute (i.e. class) latitude and longitude respectively.
Now we have all the information, including the names of the ports, their urls, and know how to access to their coordinates, we can now roll up our sleeves and work on our Python script to combine them.
Since we will be scraping multiple (33 to be exact) pages it is good to create a function to do the heavy work:
Next, we need to define a list which will store all ports' urls that we have extracted earlier. This is simply a copy and paste step.
Time for running our function to extract coordinates from each port's page. We will be storing them in a dictionary called lat_log_dict
Output:
Step 3: Storing Data with Pandas
With the data extracted, we'll use the Pandas library to organize and store the port details in a structured format, making it easier to work with the data.
Output:
We now have coordinates for all the Major ports in China. All the coordinates are degree-minutes-seconds format. Python's mapping package, Folium, however, works with decimal degress. Therefore, before we proceed to map the ports we need to convert the latitude and longitudes values to decimal degrees
Conversion Formula:
decimal degrees = degrees + (minutes / 60) + (seconds/3600)
To do that we will create another function called dms2dd that will convert each coordinate to decimal degrees.
We will now apply the function using pandas' apply method to latitude and longitude data:
Output:
Step 4: Setting Up Folium and OpenStreetMap (OSM)
We'll begin visualizing the port locations using the Folium library with the 'OpenStreetMap (OSM)' base map. It will give us a basic map view but may present location names in Chinese, which can be a bit challenging for non-Chinese speakers.
Output:
Due to HTML limitations the map pasted above is not interactive. Once you run the code you will be able interact with it on your computer as shown below
As you can see in the map that all names are in local language, meaning unless you speak Chinese, we have no way of telling where, for instance, the city of Dilian is. Therefore we need to translate the names into English. At the time of this writing (May 2023), OpenStreetMap does not support/offer English translations for countries where English is not the official language.
Step 5: Creating a Mapbox Account
To overcome the language issue, we'll create an account on Mapbox. Mapbox offers an English base map, which will make our map more user-friendly non-Chinese speakers.
Step 6: Generating a Mapbox Public Key
Once you have your Mapbox account set up, you will generate a public key, or used the default, to access the Mapbox maps in our Python script. For this tutorial, I created an access token called china_ports
Step 7: Using Mapbox's English Base Map
We'll modify our Python script to use the Mapbox's Map with the English base layer. This will allow us to display location names in English, and enhance the user experience. The map will have interactive features (once run), allowing users to zoom in, zoom out, and click on port markers to get additional information.
Output:
Once again due to HTML limitations the map pasted above is not interactive. Once you run the code you will be able interact with it on your computer. For interactive Mapbox Map please click the following button
Conclusion
Congratulations! We've successfully created an interactive map displaying major ports in China using Python, BeautifulSoup, Pandas, and Folium with Mapbox's English base map. You can now explore and share this map with others, enhancing their understanding of the major ports in China.
Last updated