Mapping Air Pollution with leaflet

Alice Tivarovsky 2022-08-25 241 minute read

Motivation

These days, it seems like there is an ever-growing list of massive, complicated issues to worry about - between climate change, nuclear war, mega-droughts, wildfires. Reading the news feels like an exercise in adding to that list.

But sometimes, the news gives us a bright spot. My home state of California - a place, which was once home to some of the worst air pollution in the world, recently passed a resolution to eliminate gasoline-powered cars by 2035. I was modestly excited by this, but I was more interested in the current air pollution situation. Since I don’t think about air pollution all that much, I ended up in a rabbit-hole of unstructured research. How bad is air pollution now? How does my state, city, and county compare to others in the US? Is air pollution getting better or worse?

I will attempt to answer some of those questions below. Of course, others have already done that, and probably in a more compelling manner. But I learned a lot going through the process myself, and if you want to learn a bit about geospatial data andleaflet, I think an exercise like this is highly useful. For a thorough tutorial on the capabilities of leaflet in R, please refer to the official tutorial here.

Background

Globally, over 6 million people die each year as a result of air pollution. It’s not something we think much about in the United States or other parts of the developed world (aside from, maybe, complaining about the smog in Los Angeles), but it’s a leading cause of death in parts of Africa and Asia. And unfortunately, it seems to be getting worse in other parts of the world.

Although progress has been made through better policy and environmental regulation, this magnitude of premature deaths should surely attract more attention than it does now. Some have posed plausible reasons for the public’s lack of interest (mainly that it’s an “over there” problem). Others have scolded countries for not doing more to protect public health. Either way, the health effects of air pollution require more research and attention.

Data Preparation

library(tidyverse)
library(readxl)
library(leaflet)
library(rgdal) # Geospatial data abstraction library
library(stats)
library(sp)
library(RColorBrewer)

In the United States, the EPA maintains tens of thousands of outdoor air quality monitors and provides data to the public through several sources, including the AirData website. First, we’ll practice mapping skills by pinning all these monitors on a map. The geo-locations for all active monitors were downloaded here. We read in the data, call leaflet, add tiles, and center on a point in the middle of the country.

sites <- read.csv("./data/aqs_sites 3.csv", header = TRUE)

# Choose coordinates in middle of country
naked_map <- 
  leaflet() %>% 
  setView(lng = -95, lat = 40, zoom = 4)  %>% 
  addTiles() 

naked_map


It never fails to amaze me how little code it takes to make a map. Next, we’ll pass the sites object (EPA refers to the location monitors as sites) to leaflet and visualize them as markers on the map.

sites_map <- 
  leaflet(sites) %>% 
  setView(lng = -95, lat = 40, zoom = 4)  %>% 
  addTiles() %>% 
  addCircleMarkers(lat = ~Latitude, 
                   lng = ~Longitude, 
                   radius = 1)

sites_map


Most states look like blue blobs, but we can zoom in and find the ones in our neighborhood (unless, of course, we live in the middle of Nevada).

Analysis

Now let’s get a sense of what those monitors are measuring.

As someone not well-versed in environmental pollutants, figuring out what dataset and which measures to use was a bit complicated. Two common measures are PM 2.5 (particulate matter 2.5 micrometers or smaller in diameter) and ground level ozone.

But a more general metric, one which combines these and other toxic pollutants, is the Air Quality Index (AQI). Even if you’re not terribly concerned about air pollution, you’ve probably seen AQI measures before, possibly through your weather app telling you the air quality is bad so close your windows and don’t even think about leaving your house. Or perhaps you’ve seen a local alert or warning to that effect. 1

Given its ubiquity, and the fact that it’s a more general metric than PM 2.5, ground level ozone, or any other specific particulate, I decided to look at AQI. More specifically, we’ll be looking at annual median AQI by county. State-level measures are too large for air pollution data, and city level is likely too small. I’ll be comparing 2021 data to 2013 (the years were chosen arbitrarily). We’ll load the data and look at a snapshot.

# annual 'concentration by monitor' dataset for 2013
full_2013 <- 
  read.csv("./data/annual_aqi_by_county_2013.csv", header = TRUE) %>% 
  janitor::clean_names()

head(full_2013)
##     state  county year days_with_aqi good_days moderate_days
## 1 Alabama Baldwin 2013           273       235            38
## 2 Alabama    Clay 2013           118       100            18
## 3 Alabama Colbert 2013           285       252            33
## 4 Alabama  DeKalb 2013           360       319            41
## 5 Alabama  Elmore 2013           244       229            15
## 6 Alabama  Etowah 2013           281       244            37
##   unhealthy_for_sensitive_groups_days unhealthy_days very_unhealthy_days
## 1                                   0              0                   0
## 2                                   0              0                   0
## 3                                   0              0                   0
## 4                                   0              0                   0
## 5                                   0              0                   0
## 6                                   0              0                   0
##   hazardous_days max_aqi x90th_percentile_aqi median_aqi days_co days_no2
## 1              0      87                   54         36       0        0
## 2              0      65                   52         32       0        0
## 3              0      80                   51         38       0        0
## 4              0      93                   52         38       0        0
## 5              0      87                   48         36       0        0
## 6              0      90                   52         37       0        0
##   days_ozone days_pm2_5 days_pm10
## 1        200         73         0
## 2          0        118         0
## 3        202         83         0
## 4        305         55         0
## 5        244          0         0
## 6        197         84         0
# annual 'concentration by monitor' dataset for 2021
full_2021 <- 
  read.csv("./data/annual_aqi_by_county_2021.csv", header = TRUE) %>% 
  janitor::clean_names() 

head(full_2021)
##     state    county year days_with_aqi good_days moderate_days
## 1 Alabama   Baldwin 2021           280       264            16
## 2 Alabama      Clay 2021           110       101             9
## 3 Alabama    DeKalb 2021           361       331            30
## 4 Alabama    Elmore 2021           241       238             3
## 5 Alabama    Etowah 2021           286       253            33
## 6 Alabama Jefferson 2021           365       174           186
##   unhealthy_for_sensitive_groups_days unhealthy_days very_unhealthy_days
## 1                                   0              0                   0
## 2                                   0              0                   0
## 3                                   0              0                   0
## 4                                   0              0                   0
## 5                                   0              0                   0
## 6                                   3              2                   0
##   hazardous_days max_aqi x90th_percentile_aqi median_aqi days_co days_no2
## 1              0      61                   48         35       0        0
## 2              0      67                   47         27       0        0
## 3              0      84                   49         36       0        0
## 4              0      87                   44         32       0        0
## 5              0      77                   51         36       0        0
## 6              0     179                   77         51       0        1
##   days_ozone days_pm2_5 days_pm10
## 1        221         59         0
## 2          0        110         0
## 3        320         41         0
## 4        241          0         0
## 5        213         73         0
## 6         83        280         1

Now we need to figure out how to present the AQI data visually - one way might be to color the marker representing each monitor in accordance with the level of pollution it measured for some time interval. But given the high concentration of dots, it might be hard to read. Instead, let’s try to divide the United States by county lines. The intent of this approach is to mimic what the Washington Post used in this 2019 article, where it reported an excess of 10,000 deaths due to air pollution over a 2-year period.

To do this, we’ll need county shapefiles- geo-spatial boundary files that will trace the shape of each county - which will comprise a layer over our bare leaflet map. The Census Bureau maintains shapefiles in accordance with current county boundaries, and the ones used below were found here.

Reading these shapefiles into R requires the rgdal package and readOGR function.

# Source for shapefiles: https://www.census.gov/geographies/mapping-files/time-series/geo/carto-boundary-file.html
counties <- 
  readOGR("cb_2018_us_county_5m", layer = "cb_2018_us_county_5m", GDAL1_integer64_policy = TRUE)
## OGR data source with driver: ESRI Shapefile 
## Source: "/Users/AliceTivarovsky/DataScience/my_website/content/blog/2022-08-25-leaflet_pollution/cb_2018_us_county_5m", layer: "cb_2018_us_county_5m"
## with 3233 features
## It has 9 fields
## Integer64 fields read as doubles:  ALAND AWATER
counties_map <- 
  leaflet(counties) %>% 
  setView(lng = -95, lat = 40, zoom = 4)  %>% 
  addTiles() %>% 
  addPolygons(weight = 0.1, smoothFactor = 0.5,  opacity = 1.0, fillOpacity = 0.2)

counties_map


The shapefiles are working as expected - we can see well-defined state and county borders, overlaid nicely on top of the map. But what we want is a map, like the one above, but color-coded according to the value of the selected parameter - in our case, the median annual AQI. In map-speak, this is called a chloropleth, and in order to make one, we need to give leaflet the following things: - A color palette - A numeric vector defining “bins” (i.e cutoff points at which one color in the palette will change to the next)

Since we don’t yet know the range of values for median AQI, let’s run some descriptive statistics on it.

# 2013 data
range(full_2013$median_aqi)
## [1]  4 89
quantile(full_2013$median_aqi)
##   0%  25%  50%  75% 100% 
##    4   34   38   43   89
# 2021 data
range(full_2021$median_aqi)
## [1]   3 122
quantile(full_2021$median_aqi)
##   0%  25%  50%  75% 100% 
##    3   33   37   41  122

Without any calculations or data visualization, we can already see that the 2021 maximum is much higher than 2013. But let’s press on, now that we have a general sense of the spread of AQI. Note that the color palette was chosen arbitrarily from RColorBrewer.

# 2013
bins_13 <- c(0, 20, 40, 60, 80, 100, 120, 140)
pal_13 <- colorBin("Reds", domain = full_2013$median_aqi, bins = bins_13)

# 2021
bins_21 <-  c(0, 20, 40, 60, 80, 100, 120, 140)
pal_21 <- colorBin("Reds", domain = full_2021$median_aqi, bins = bins_21)

And now, we get to the somewhat tricky part of the process. We need to combine the shapefile object counties with the 2013 and 2021 air pollution datasets stored in full_2013 and full_2021. This seems impossible since we’re combining a geospatial object with a standard dataset, but it turns out to be pretty simple with the merge function in the sp package:

# 2013 merge
full_2013_1 <- 
  full_2013 %>% 
  select(county, median_aqi) %>% 
  rename("NAME" = county)

counties_13 <- merge(counties, full_2013_1, by = "NAME", duplicateGeoms = TRUE)

# 2021 merge
full_2021_1 <- 
  full_2021 %>% 
  select(county, median_aqi) %>% 
  rename("NAME" = county) 

counties_21 <- merge(counties, full_2021_1, by = "NAME", duplicateGeoms = TRUE)

Finally, we’re ready to map the data. Notice that we’re using a smaller value in the weight argument of addPolygons than we did above. This is to make the boundaries seamless and minimize visual clutter now that the counties are presented with different colors. We are also using a pop-up label, so that the user can see the name of the county corresponding to the polygon.

map_2013 <- 
  leaflet(counties_13) %>% 
  setView(lng = -95, lat = 40, zoom = 4)  %>% 
  addTiles() %>% 
  addPolygons(weight = 0.05, fillOpacity = 0.8, fillColor = ~pal_13(counties_13$median_aqi), popup = ~paste(counties_13$NAME, "<br>", "Median Annual AQI:", counties_13$median_aqi))

map_2013
map_2021 <- 
  leaflet(counties_21) %>% 
  setView(lng = -95, lat = 40, zoom = 4)  %>% 
  addTiles() %>% 
  addPolygons(weight = 0.01, fillOpacity = 0.8, fillColor = ~pal_21(counties_21$median_aqi), popup = ~paste(counties_21$NAME, "<br>", "Median Annual AQI:", counties_21$median_aqi))

map_2021


There’s definitely quite a bit more red on the 2021 map. How much more is hard to quantify just eyeballing, so we’d want a good statistical test if we were to go any further. What is useful, though, is in-map comparison across counties.

Conclusion

I hope this walk-through inspired you to play with leaflet on your own. I said it before, but I never cease to be amazed by the capabilities of interactive packages in R, and leaflet is as nice as they get. Please don’t hesitate to contact me if you have any questions or suggestions. Happy mapping!


  1. Here in Los Angeles, we see them frequently during fire season.↩︎