Introduction

Choropleth maps let us visualize how a variable varies across a geographic area or show the level variability within a region. We’re going to generate one using Python (and a few additional packages) and solve a real-world business problem as well. Here’s what we’re going to create:

Geomap

All of the files to follow along with this tutorial can be found here.

The Ask

‘Using our attendance data, draw a map of attendance to neighbourhoods in the city for the month of December.’

This is a real-world example of a request that would come in at my organization. Of course, out of respect of data privacy, and my employer, the ‘internal’ data you’ll see here has been fabricated, however, still gives a good example and one that could easily be adjusted to fit your needs.

Data Collection

Step one, collect the data.

We’re asked to present a map of attendance to neighbourhoods in the city for the month of December. There’s just one problem: We don’t collect neighbourhoods in our point-of-sale system, nor is there a report that would give me that information. We do, however, collect postal codes (you might think of them as zip codes) and have a table, t_attendance that will link customer records to table t_address. A quick SQL query will give us the information we need to continue. Of course, this method of data collection is unique to me. You’ll need to figure out how to pull this information yourself:

SELECT b.postal_code, COUNT(b.postal_code) FROM T_ATTENDANCE a
INNER JOIN T_ADDRESS b
ON a.customer_no = b.customer_no
WHERE a.attend_dt >= '12-01-2022'
AND a.attend_dt <= '12-31-2022'
GROUP BY b.postal_code

If you’re an SQL guru, feel free to pick apart my query. If you’re new to SQL, let me explain:

If I select everything (SELECT * FROM…), I get more than 50 columns and over 200,000 lines of data, most of which I don’t need. If the ask was to filter out individuals who are members, or exclude a specific customer group, I could include more parameters or export the entire set and filter my data with Excel. The above query gives me exactly what I’m looking for to proceed - grouped postal codes, and the total attendance for each using only two columns and just over 11,000 rows. Export this to a csv, which I’ve named DecemberAttendance.csv

SQL data

Next, we obtain a GeoJSON file of our city. GeoJSON is a format for encoding a variety of geographic data structures, like so:

{
"type": "Feature",
"geometry": {
    "type": "Point",
    "coordinates": [125.6, 10.1]
  },
"properties": {
    "name": "Dinagat Islands"
  }
}

It can take thousands of the above to draw a line around an area. Each point in ‘coordinates’ results in one point on a map. Just to draw a relatively small ‘square-ish’ shaped object can take hundreds of points of data. For this reason, I encourage you to Google your area and find a pre-made GeoJSON file to work with.

GeoJSON example

Clean the Data

These ‘points’ in my GeoJSON relate to neighbourhood. Great! I have neighbourhood, but no postal code. What’s the link between the two? Latitude and Longitude. I can VLOOKUP the two in Excel and relate my postal codes to lat and lon and therefore neighbourhood via Python. This step is using a resource found online that associates latitude and longitude to postal codes. Again, Google is your friend here.

Cleaned Data

Great, we have our GeoJSON file, and the file that contains latitude, longitude, and attendance. Let’s link the two using Python.

First, let’s start our Python file with the required imports. If you don’t have any of the modules installed, simply use:

pip install module_name_here

To install it.

import json
import folium
import os
import pandas as pd
from math import cos, asin, sqrt

Next, we’ll load our GeoJSON file as well as our excel file

geojson = os.path.join("./calgary.json")
attendance = os.path.join("./DecemberAttendance.csv")

Open our GeoJSON file for use in pandas:

with open(geojson) as data_file:   
    data = json.load(data_file)

Convert the opened file to a Pandas dataframe:

df = pd.json_normalize(data, 'features', ['properties',], errors='ignore', record_prefix='locations_')

Great, our file is starting to look a little more readable. We have neighbourhood and a nested list of coordinates:

Cleaned Data

Grab just the locations (neighbourhoods) and coordinates from the GeoJSON file removing much of the unneeded noise:

clean_df = df[['locations_properties.name', 'locations_geometry.coordinates']]
clean_df_dict = clean_df.set_index('locations_properties.name').T.to_dict('list')

Instead of a nested list, I want a dataframe with a line for each associated with the neighbourhood. Declare some new empty dataframes loop through the lists to get what we’re after:

data = pd.DataFrame([])
new_df = pd.DataFrame([])

counter = 0
for x in clean_df['locations_properties.name']:
    data = pd.DataFrame(clean_df['locations_geometry.coordinates'][counter][0][0], columns=['lon', 'lat'])
    data['location'] = clean_df['locations_properties.name'][counter]
    new_df = new_df.append(data)
    counter += 1

Separate these out into a new dictionary, as well as a list of just neighbourhood names:

new_df = new_df[['lat', 'lon', 'location']]
new_dictionary = new_df.to_dict('records')
new_dict_list = new_df['location'].tolist()

Using our attendance file, convert to a dataframe and then a dictionary:

fields = ['lat', 'lon', 'attendance']
attendance_df = pd.read_csv(attendance, usecols=fields)
attendance_df["lon"] = pd.to_numeric(attendance_df["lon"])
attendance_dict = attendance_df.to_dict('records')

The Haversine Formula

Okay, we’ve got our attendance dataframe, and our geojson dataframe in matching formats. Time to link them together using the Haversine formula. What’s that you ask? Well, The haversine formula is a very accurate way of computing distances between two points on the surface of a sphere using the latitude and longitude of the two points. More info here. First, we build a function to apply the formula on each row in our attendance_dict:

def distance(lat1, lon1, lat2, lon2):
    p = 0.017453292519943295
    hav = 0.5 - cos((lat2-lat1)*p)/2 + cos(lat1*p)*cos(lat2*p) * (1-cos((lon2-lon1)*p)) / 2
    return 12742 * asin(sqrt(hav))
 
def closest(data, v):
    return min(data, key=lambda p: distance(v['lat'],v['lon'],p['lat'],p['lon']))

Define a temporary list and empty array to map the results to:

tempDataList = new_dictionary

places = []
places = (map (lambda x: (closest(tempDataList, x)['location']) , attendance_dict))
places = [*places]

Finally, insert as a new column in our attendance_df dataframe:

attendance_df['location']= places

Heads up, this can take some time to run depending on the size of your dataset. I tend to average about 1,000 seconds, or almost 17 minutes to run this section of the script. Once that’s done, however, we have finally linked postal code to latitude and longitude and then to neighbourhood!

Cleaned Data

Just one small step of cleanup before we have a dataframe ready to plot: the neighbourhoods that contain 0 attendance, and therefore are not in our attendance_df dataframe.

For this, we’ll create a new list containing the difference of our places dataframe, or rather the neighbourhoods that exist in places but don’t exist in attendance_df:

to_zero = set(new_dict_list).difference(places)
to_zero = list(to_zero)

We’ll establish this as a new dataframe, with 0 as the latitude and longitude:

to_zero_df = pd.DataFrame([])
to_zero_df['lat'] = '0'
to_zero_df['lon'] = '0'
to_zero_df['location']=to_zero

And finally, create one last dataframe that combines both attendance_df and our zeroed dataframes:

total_attendance_df = pd.concat([attendance_df, to_zero_df])

We no longer care about latitude and longitude here, so let’s group by just location and attendance:

plot_df = total_attendance_df.groupby("location")["attendance"].sum()

Plotting

Here you’ll want to find the center of your city. On Google, the center of my city are the points 51.0447, -114.0719. We’ll start a folium map centered at that location, and, through playing with values, start my zoom level at 10.

map = folium.Map(location=[51.0447, -114.0719], zoom_start=10)

Now the settings for the map:

folium.Choropleth(geo_data=geojson,
    data = plot_df,
    columns=['location','attendance'],
    key_on='feature.properties.name',
    fill_color='YlGnBu',
    fill_opacity=0.8,
    line_opacity=0.7,
    legend_name='Q4 2022 Attendance').add_to(geo_map)
folium.TileLayer('cartodbpositron').add_to(geo_map)

What are these?

  • Data : our dataset being plotted
  • Columns: the columns we want to use
  • Key_on: The matching key in our geojson file. In this example it’s feature.properties.name. Yours might be different depending on how the json file was built.
  • Fill_color: colour of the map. You can choose between: ‘BuGn’, ‘BuPu’, ‘GnBu’, ‘OrRd’, ‘PuBu’, ‘PuBuGn’, ‘PuRd’, ‘RdPu’, ‘YlGn’, ‘YlGnBu’, ‘YlOrBr’, and ‘YlOrRd’
  • Fill_opacity: How transparent your map is through the coloured blocks. Play with this to suit your needs.
  • Line_opacity: How transparent the lines around your areas are.
  • Legend_name: Name that appears under the legend
  • Folium.TileLayer: The appearance of the map. Options here. They all have very different aesthetics.

Finally, call the map

geo_map

Final Map

Here you can zoom in and out to your liking. Looks like it’s time to up our marketing game in the N.E. area of the city. We can start a proposal for a campaign based on this one map and track the success over a set period of time and also see what’s working so well in Fish Creek Park.

That concludes a somewhat unique approach to plot a map in Python. Your case may be different resulting in much less data linking and cleanup.