Web Scraping and Mapping

Creating a Map of Major Ports in China using Python and Mapbox's Map

Sedar Sahin, May 2023

Introduction

In this tutorial, we will learn how to create an interactive map showcasing the major ports in China. Port locations become especially important if you ship goods to or from China. Therefore, for those you who (will) work as a Data Scientist (or Analyst) in a logictics, trading, or manufacturing company, it is good to know where the these ports are. These skills you will learn in this practice will help you visualize not only (and taking the step the further and apply geospatial analyses) the locations of ports, but any kind of establishment with latitude and longitude info for stakeholders.

We will use Python's libraries like Requests, BeautifulSoup, Pandas, and Folium to gather the data, manipulate it, and visualize it on the map. Initially, we'll encounter a small challenge with the map's language, but we'll overcome it by utilizing Mapbox's English base map.

Prerequisites:

  • Basic knowledge of Python

  • Familiarity with web scraping using BeautifulSoup

  • Understanding of data manipulation with Pandas

  • Basic awareness of Folium library for map visualization

Step 1: Installing Required Libraries

First things first, ensure you have Python installed on your system. Then, install the necessary libraries using pip if you haven't already:

pip install requests beautifulsoup4 pandas folium

Step 2: Web Scraping to Get Major Ports Data

Once we have the necessary pakages install, we can now start working on our favorite Python interface, any IDE of your choice or a Jupyter Notebook.

Because we don't have access to such data with port names along with their latitude and longitude, we need to create it. We will fetch the list of major ports in China from the Wikipedia page (see below) using Python's requests and BeautifulSoup libraries. To do this, we'll write a Python script to scrape the data.

Here is the list of Ports in the tabular form:

NR
Port

1

Dalian

2

Yingkou

3

Jinzhou

4

Qinhuangdao

5

Tianjin

6

Yantai

7

Weihai

8

Qingdao

9

Rizhao

10

Lianyungang

11

Nantong

12

Zhenjiang

13

Jiangyin

14

Nanjing

15

Shanghai

16

Ningbo

17

Zhoushan

18

Jiujiang

19

Taizhou (North of Wenzhou)

20

Wenzhou

21

Taizhou (South of Wenzhou)

22

Changle

23

Quanzhou

24

Xiamen

25

Shantou

26

Jieyang

27

Guangzhou

28

Zhuhai

29

Shenzhen

30

Zhanjiang

31

Beihai

32

Fangchenggang

33

Haikou

34

Basuo

To retrieve coordinates we need to click on each one of the ports on the page. The below screenshot shows the details about the Dalian Port, and its coordinates in two different places on the page both of which highlighted with red rectangles:

The way Wikipedia structured its url is:

"https://en.wikipedia.org/wiki/" + <Port Name>

Using this information we wil be creating urls to access to each port's page in order to extract their coordinates. We can use apps like Excel, Numbers, Google Sheets or simply a text editor to achieve this.

The following table shows the ports with their corresponding Wikipedia pages:

NR
Port
URL Wiki

1

Dalian

2

Yingkou

3

Jinzhou

4

Qinhuangdao

5

Tianjin

6

Yantai

7

Weihai

8

Qingdao

9

Rizhao

10

Lianyungang

11

Nantong

12

Zhenjiang

13

Jiangyin

14

Nanjing

15

Shanghai

16

Ningbo

17

Zhoushan

18

Jiujiang

19

Taizhou,_Zhejiang

20

Wenzhou

21

Taizhou (South of Wenzhou)

NA

22

Changle

23

Quanzhou

24

Xiamen

25

Shantou

26

Jieyang

27

Guangzhou

28

Zhuhai

29

Shenzhen

30

Zhanjiang

31

Beihai

32

Fangchenggang

33

Haikou

34

Basuo

Please note that the Port #19 was replaced by "Taizhou,_Zhejiang" in order to retrieve its coordinates and #21 was skipped because there is no geographical coordinate data available on Wikipedia for this port with this name.

Keep in mind extracting URLs can be automated via scraping as well. For the sake of this tutorial, I wanted to use scraping only for one job, which is to extracting coordinates of the ports.

We have links of the ports, it is time to retrieve their coordinates. To do that we first "inspect" the HTML (no need to worry about it for this tutorial) to see via which element/attribute the latitude and longitute information is being stored.

The image below shows that latitude and longitudes are stored in HTML attribute (i.e. class) latitude and longitude respectively.

Now we have all the information, including the names of the ports, their urls, and know how to access to their coordinates, we can now roll up our sleeves and work on our Python script to combine them.

# Import Libraries

# Web Scraping
import requests
from bs4 import BeautifulSoup

# Data Manipulation
import pandas as pd

# Regular Expression Operations (to be used in degree conversion)
import re

# Maps
import folium

Since we will be scraping multiple (33 to be exact) pages it is good to create a function to do the heavy work:

def get_latlon_from_wikipedia(url):
    
    # access to the url
    response = requests.get(url)
    if response.status_code != 200:
        raise Exception("Failed to fetch the page.")

    # parse the page
    soup = BeautifulSoup(response.text, 'html.parser')
    
    # retrieve latitude and longitude info
    latitude_span = soup.find('span', {'class': 'latitude'})
    longitude_span = soup.find('span', {'class': 'longitude'})

    if not latitude_span:
        raise Exception("Latitude data not found on the page.")
        
    if not longitude_span:
        raise Exception("Longitude data not found on the page.")
    
    # assign the coordinates to latitude and longitude variables
    latitude = latitude_span.text.strip()
    longitude = longitude_span.text.strip()
    
    # store the latitude and longitude info in a list
    latlon = [latitude, longitude]
    
    return latlon

Next, we need to define a list which will store all ports' urls that we have extracted earlier. This is simply a copy and paste step.

url_list=[
'https://en.wikipedia.org/wiki/Dalian',
'https://en.wikipedia.org/wiki/Yingkou',
'https://en.wikipedia.org/wiki/Jinzhou',
'https://en.wikipedia.org/wiki/Qinhuangdao',
'https://en.wikipedia.org/wiki/Tianjin',
'https://en.wikipedia.org/wiki/Yantai',
'https://en.wikipedia.org/wiki/Weihai',
'https://en.wikipedia.org/wiki/Qingdao',
'https://en.wikipedia.org/wiki/Rizhao',
'https://en.wikipedia.org/wiki/Lianyungang',
'https://en.wikipedia.org/wiki/Nantong',
'https://en.wikipedia.org/wiki/Zhenjiang',
'https://en.wikipedia.org/wiki/Jiangyin',
'https://en.wikipedia.org/wiki/Nanjing',
'https://en.wikipedia.org/wiki/Shanghai',
'https://en.wikipedia.org/wiki/Ningbo',
'https://en.wikipedia.org/wiki/Zhoushan',
'https://en.wikipedia.org/wiki/Jiujiang',
'https://en.wikipedia.org/wiki/Taizhou,_Zhejiang',
'https://en.wikipedia.org/wiki/Wenzhou',
'https://en.wikipedia.org/wiki/Taizhou (South of Wenzhou)',
'https://en.wikipedia.org/wiki/Changle',
'https://en.wikipedia.org/wiki/Quanzhou',
'https://en.wikipedia.org/wiki/Xiamen',
'https://en.wikipedia.org/wiki/Shantou',
'https://en.wikipedia.org/wiki/Jieyang',
'https://en.wikipedia.org/wiki/Guangzhou',
'https://en.wikipedia.org/wiki/Zhuhai',
'https://en.wikipedia.org/wiki/Shenzhen',
'https://en.wikipedia.org/wiki/Zhanjiang',
'https://en.wikipedia.org/wiki/Beihai',
'https://en.wikipedia.org/wiki/Fangchenggang',
'https://en.wikipedia.org/wiki/Haikou',
'https://en.wikipedia.org/wiki/Basuo',
]

Time for running our function to extract coordinates from each port's page. We will be storing them in a dictionary called lat_log_dict

# create a dictionary to store the coordinates
lat_log_dict = {}

# counter
cnt = 0

# loop through the list of urls and extract and store the coordinates

for url in url_list:
    # copy url
    wikipedia_url = url
    
    # extract the name of the port
    port = wikipedia_url.split("/")[-1]
    
    try:
        latlon = get_latlon_from_wikipedia(wikipedia_url)
        # print(f" {cnt} - Lat/Lon for {port}:", latlon) # to display current port and its coordinates
        
        lat_log_dict[port] = latlon
        
    except Exception as e:
        print("Error:", str(e))
    cnt+=1
        
lat_log_dict

Output:

{'Dalian': ['38°54′N', '121°36′E'],
 'Yingkou': ['40°37′30″N', '122°13′08″E'],
 'Jinzhou': ['41°07′44″N', '121°08′53″E'],
 'Qinhuangdao': ['39°53′18″N', '119°31′13″E'],
 'Tianjin': ['39°08′01″N', '117°12′19″E'],
 'Yantai': ['37°27′53″N', '121°26′52″E'],
 'Weihai': ['37°30′48″N', '122°07′14″E'],
 'Qingdao': ['36°04′01″N', '120°22′58″E'],
 'Rizhao': ['35°25′01″N', '119°31′37″E'],
 'Lianyungang': ['34°35′48″N', '119°13′17″E'],
 'Nantong': ['31°58′52″N', '120°53′38″E'],
 'Zhenjiang': ['32°11′17″N', '119°25′26″E'],
 'Jiangyin': ['31°50′20″N', '120°17′42″E'],
 'Nanjing': ['32°03′39″N', '118°46′44″E'],
 'Shanghai': ['31°13′43″N', '121°28′29″E'],
 'Ningbo': ['29°51′37″N', '121°37′28″E'],
 'Zhoushan': ['29°59′08″N', '122°12′27″E'],
 'Jiujiang': ['29°39′40″N', '115°57′14″E'],
 'Taizhou,_Zhejiang': ['28°39′21″N', '121°25′15″E'],
 'Wenzhou': ['27°59′38″N', '120°41′57″E'],
 'Changle': ['25°55′N', '119°33′E'],
 'Quanzhou': ['24°52′28″N', '118°40′33″E'],
 'Xiamen': ['24°28′47″N', '118°05′20″E'],
 'Shantou': ['23°21′14″N', '116°40′55″E'],
 'Jieyang': ['23°33′04″N', '116°22′22″E'],
 'Guangzhou': ['23°07′48″N', '113°15′36″E'],
 'Zhuhai': ['22°16′18″N', '113°34′37″E'],
 'Shenzhen': ['22°32′29″N', '114°03′35″E'],
 'Zhanjiang': ['21°16′12″N', '110°21′27″E'],
 'Beihai': ['21°28′52″N', '109°07′12″E'],
 'Fangchenggang': ['21°41′12″N', '108°21′17″E'],
 'Haikou': ['20°01′07″N', '110°20′56″E'],
 'Basuo': ['19°05′31″N', '108°40′16″E']}

Step 3: Storing Data with Pandas

With the data extracted, we'll use the Pandas library to organize and store the port details in a structured format, making it easier to work with the data.

# create a dataframe
df = pd.DataFrame(data=lat_log_dict)
df = df.T
df.reset_index(inplace=True)

# update the column names
df = df.rename(columns={'index':'Port', 0:'Lat (DMS)',1:'Lon (DMS)'})

# diplay first 3 rows
df[:3]

Output:

We now have coordinates for all the Major ports in China. All the coordinates are degree-minutes-seconds format. Python's mapping package, Folium, however, works with decimal degress. Therefore, before we proceed to map the ports we need to convert the latitude and longitudes values to decimal degrees

Conversion Formula:

decimal degrees = degrees + (minutes / 60) + (seconds/3600)

To do that we will create another function called dms2dd that will convert each coordinate to decimal degrees.

def dms2dd(s):
    # example: s = """ 0°11'23.29"S """
    
    if '″' not in s:
        s = s[:-1] + '0″'+ s[-1]
    
    degrees, minutes, seconds, direction = re.split('[°′″\"]+', s)
    
    dd = float(degrees) + float(minutes)/60 + float(seconds)/(60*60);
    
    if direction in ('S','W'):
        dd*= -1
    return dd

We will now apply the function using pandas' apply method to latitude and longitude data:

df['Lat'] = df['Lat (DMS)'].apply(dms2dd)
df['Lon'] = df['Lon (DMS)'].apply(dms2dd)
df.head(10)

Output:

Step 4: Setting Up Folium and OpenStreetMap (OSM)

We'll begin visualizing the port locations using the Folium library with the 'OpenStreetMap (OSM)' base map. It will give us a basic map view but may present location names in Chinese, which can be a bit challenging for non-Chinese speakers.

# Tiles from OpenStreetMap (it does not offer English names for local cities, use MapBox instead)
tiles = 'OpenStreetMap'

# put geo. locations to a list 
location_list_df = df[['Lat','Lon']].values.tolist()


# Display the ports

# create basemap
map_obj = folium.Map(location=[df.Lat.mean(), df.Lon.mean()],
                    tiles=tiles,
                    zoom_start=4)

# place ports on the map
for loc in range(len(location_list_df)):
    folium.Marker(location=location_list_df[loc],
                  popup=df.iloc[loc].Port,
#                   icon = folium.Icon(color='blue', icon=f'info-sign') # if wants to use an icon
                  icon = folium.DivIcon(
                  html=('<svg height="100" width="100">'
                        '<circle cx="10" cy="10" r="5" stroke="red" stroke-width="3" fill="yellow" />'
                        f'<text x="15" y="15" font-size: 5pt fill="black">{df.iloc[loc].Port}</text>'
                        '</svg>')
                    )
                 ).add_to(map_obj)

# save the map
# map_obj.save('china_ports_osm.html')

# display the map
map_obj

Output:

Due to HTML limitations the map pasted above is not interactive. Once you run the code you will be able interact with it on your computer as shown below

As you can see in the map that all names are in local language, meaning unless you speak Chinese, we have no way of telling where, for instance, the city of Dilian is. Therefore we need to translate the names into English. At the time of this writing (May 2023), OpenStreetMap does not support/offer English translations for countries where English is not the official language.

Step 5: Creating a Mapbox Account

To overcome the language issue, we'll create an account on Mapbox. Mapbox offers an English base map, which will make our map more user-friendly non-Chinese speakers.

Step 6: Generating a Mapbox Public Key

Once you have your Mapbox account set up, you will generate a public key, or used the default, to access the Mapbox maps in our Python script. For this tutorial, I created an access token called china_ports

Step 7: Using Mapbox's English Base Map

We'll modify our Python script to use the Mapbox's Map with the English base layer. This will allow us to display location names in English, and enhance the user experience. The map will have interactive features (once run), allowing users to zoom in, zoom out, and click on port markers to get additional information.

# Tiles from MapBox 
mapbox_access_token = '<YOUR ACCESS TOKEN>' # example: mapbox_access_token='pk.eyJ1I...'

# Select a tile style for Mapbox map
tileset_ID_str = "streets-v12" # Mapbox Styles: https://docs.mapbox.com/api/maps/styles/
tilesize_pixels = "256"
tiles = f"https://api.mapbox.com/styles/v1/mapbox/{tileset_ID_str}/tiles/{tilesize_pixels}/{{z}}/{{x}}/{{y}}@2x?access_token={mapbox_access_token}" 

# Display the ports
# create basemap
location_list_df = df[['Lat','Lon']].values.tolist()

# place ports on the map
map_obj = folium.Map(location=[df.Lat.mean(), df.Lon.mean()],
                    tiles=tiles,
                    zoom_start=4,
                    attr='Mapbox')

for loc in range(len(location_list_df)):
    folium.Marker(location=location_list_df[loc],
                  popup=df.iloc[loc].Port,
#                   icon = folium.Icon(color='blue',icon=f'info-sign')
                  icon = folium.DivIcon(
                  html=('<svg height="100" width="100">'
                        '<circle cx="10" cy="10" r="5" stroke="red" stroke-width="3" fill="yellow" />'
                        f'<text x="15" y="15" font-size: 5pt fill="black">{df.iloc[loc].Port}</text>'
                        '</svg>')
                    )
                 ).add_to(map_obj)
                
# save the map 
map_obj.save('china_ports_mbx.html')

# display the map
map_obj

Output:

Once again due to HTML limitations the map pasted above is not interactive. Once you run the code you will be able interact with it on your computer. For interactive Mapbox Map please click the following button

Conclusion

Congratulations! We've successfully created an interactive map displaying major ports in China using Python, BeautifulSoup, Pandas, and Folium with Mapbox's English base map. You can now explore and share this map with others, enhancing their understanding of the major ports in China.

Last updated