I'm the Map
Introduction
In a nutshell, choropleth comes from the Greek choros (area) and pleth (value). They distribute colors the map regions based on the measure which then allows for comparisons using the relative value associated with each color of one region to another.
Preliminary Items
First Things First! Set your Working Directory
Your working directory is simply where your script will look for anything it needs like external data sets. There are a few ways to go about doing this which we will cover. However for now, just do the following:
- Open up a new script by going to
File > New File > R Script
. - Save it in a preferably empty folder as whatever you want.
- Go to the menu bar and select
Session > Set Working Directory > To Source File Location
.
Download the script
Copying and pasting syntax from a browser can cause problems. To avoid this issue, please download a script with all of the needed code presented in this walkthrough.
Loading libraries
Go ahead and load these or install and then load them.
library(tidyverse)
library(tidycensus)
library(viridis)
library(tools)
Setting up your API
As before, you will utilize the Census API. You’ll likely receive a prompt to save it using
To install your API key for use in future sessions, run this function with install = TRUE.
I suggest rerunning the command and do just that!
# Input your Census key
census_api_key("YOUR CENSUS API KEY")
# Reload the .Renviron marker
readRenviron("~/.Renviron")
Walkthrough
There are two major functions implemented in tidycensus::get_decennial()
, which grants access to the 1990, 2000, and 2010 decennial US Census APIs, and get_acs
, which grants access to the 5-year American Community Survey APIs. While limited in comparison to the other method we used, what it lacks in breadth, it soon makes up in depth which we will see soon!
Getting variables from the Census or ACS requires knowing the variable ID - and there are thousands of these IDs across the different Census files. To search for variables without pulling your hair out, use the load_variables
function. The function takes two required arguments:
the year of the Census or endyear of the ACS sample, and
the dataset - one of
sf1
,sf3
, oracs5
.
For ideal functionality, it is recommended that you assign the result of this function to a variable, setting cache = TRUE
to store the result on your computer for future access, and using the View
function in RStudio to interactively browse for variables. However, this is not required. However, as an example, we will take the 2016 American Community Survey data set and save it.
acs16 <- load_variables(2016, "acs5", cache = TRUE)
head(acs16) # or View(acs16) if you prefer
## # A tibble: 6 x 3
## name label concept
## <chr> <chr> <chr>
## 1 B00001_001 Estimate!!Total UNWEIGHTED SAMPLE COUNT OF THE POP…
## 2 B00002_001 Estimate!!Total UNWEIGHTED SAMPLE HOUSING UNITS
## 3 B01001_001 Estimate!!Total SEX BY AGE
## 4 B01001_002 Estimate!!Total!!Male SEX BY AGE
## 5 B01001_003 Estimate!!Total!!Male!!Under 5… SEX BY AGE
## 6 B01001_004 Estimate!!Total!!Male!!5 to 9 … SEX BY AGE
In this basic example, let’s look at US income by county using the data set B19013_001
(such a clear and obvious name)
title <- acs16 %>% # use the acs_16 data set we created
filter(name == "B19013_001") %>% # filter everything else out except where we find B19013_001
select(concept) # select the column concept that provides a very brief description
# Convert to title case. We have to go in a funny way simply due to the encoding of text within the Census API
title <- tolower(title) %>%
toTitleCase()
title
## [1] "Median Household Income in the Past 12 Months (in 2016 Inflation-Adjusted Dollars)"
Then let’s get some income level data from that set…
us_county_income <- get_acs(geography = "county",
variables = "B19013_001",
shift_geo = TRUE,
geometry = TRUE)
## Getting data from the 2015-2019 5-year ACS
## Using feature geometry obtained from the albersusa package
## Please note: Alaska and Hawaii are being shifted and are not to scale.
…and plot!
ggplot(us_county_income) +
geom_sf(aes(fill = estimate), color = NA) +
# Removes the graticule lines from the plot
coord_sf(datum = NA) +
theme_minimal() +
scale_fill_viridis()
Looks good. ow let’s get some West Virginia population data like we did earlier in the term.
wv_pop <- get_acs(geography = "county",
variables = "B01003_001",
state = "WV",
geometry = TRUE)
## Getting data from the 2015-2019 5-year ACS
## Downloading feature geometry from the Census website. To cache shapefiles for use in future sessions, set `options(tigris_use_cache = TRUE)`.
##
|
| | 0%
|
|=== | 4%
|
|=== | 5%
|
|==== | 5%
|
|==== | 6%
|
|===== | 7%
|
|===== | 8%
|
|====== | 8%
|
|====== | 9%
|
|======= | 9%
|
|======= | 10%
|
|======= | 11%
|
|======== | 11%
|
|======== | 12%
|
|========= | 12%
|
|========= | 13%
|
|========= | 14%
|
|========== | 14%
|
|========== | 15%
|
|=========== | 15%
|
|=========== | 16%
|
|============ | 16%
|
|============ | 17%
|
|============ | 18%
|
|============= | 18%
|
|============= | 19%
|
|============== | 19%
|
|============== | 20%
|
|============== | 21%
|
|=============== | 21%
|
|=============== | 22%
|
|================ | 22%
|
|================ | 23%
|
|================= | 24%
|
|================= | 25%
|
|================== | 25%
|
|================== | 26%
|
|=================== | 27%
|
|=================== | 28%
|
|==================== | 28%
|
|==================== | 29%
|
|===================== | 29%
|
|===================== | 30%
|
|===================== | 31%
|
|====================== | 31%
|
|====================== | 32%
|
|======================= | 32%
|
|======================= | 33%
|
|======================== | 34%
|
|======================== | 35%
|
|========================= | 35%
|
|========================= | 36%
|
|========================== | 37%
|
|========================== | 38%
|
|=========================== | 38%
|
|=========================== | 39%
|
|============================ | 39%
|
|============================ | 40%
|
|============================ | 41%
|
|============================= | 41%
|
|============================= | 42%
|
|============================== | 42%
|
|============================== | 43%
|
|============================== | 44%
|
|=============================== | 44%
|
|=============================== | 45%
|
|================================ | 45%
|
|================================ | 46%
|
|================================= | 47%
|
|================================= | 48%
|
|================================== | 48%
|
|================================== | 49%
|
|=================================== | 49%
|
|=================================== | 50%
|
|=================================== | 51%
|
|==================================== | 51%
|
|==================================== | 52%
|
|===================================== | 52%
|
|===================================== | 53%
|
|===================================== | 54%
|
|====================================== | 54%
|
|====================================== | 55%
|
|======================================= | 55%
|
|======================================= | 56%
|
|======================================== | 57%
|
|======================================== | 58%
|
|========================================= | 58%
|
|========================================= | 59%
|
|========================================== | 59%
|
|========================================== | 60%
|
|========================================== | 61%
|
|=========================================== | 61%
|
|=========================================== | 62%
|
|============================================ | 62%
|
|============================================ | 63%
|
|============================================= | 64%
|
|============================================= | 65%
|
|============================================== | 65%
|
|============================================== | 66%
|
|=============================================== | 67%
|
|=============================================== | 68%
|
|================================================ | 68%
|
|================================================ | 69%
|
|================================================= | 69%
|
|================================================= | 70%
|
|================================================= | 71%
|
|================================================== | 71%
|
|================================================== | 72%
|
|=================================================== | 72%
|
|=================================================== | 73%
|
|==================================================== | 74%
|
|==================================================== | 75%
|
|===================================================== | 75%
|
|===================================================== | 76%
|
|====================================================== | 77%
|
|====================================================== | 78%
|
|======================================================= | 78%
|
|======================================================= | 79%
|
|======================================================== | 79%
|
|======================================================== | 80%
|
|======================================================== | 81%
|
|========================================================= | 81%
|
|========================================================= | 82%
|
|========================================================== | 82%
|
|========================================================== | 83%
|
|=========================================================== | 84%
|
|=========================================================== | 85%
|
|============================================================ | 85%
|
|============================================================ | 86%
|
|============================================================= | 86%
|
|============================================================= | 87%
|
|============================================================= | 88%
|
|============================================================== | 88%
|
|============================================================== | 89%
|
|=============================================================== | 89%
|
|=============================================================== | 90%
|
|=============================================================== | 91%
|
|================================================================ | 91%
|
|================================================================ | 92%
|
|================================================================= | 92%
|
|================================================================= | 93%
|
|================================================================= | 94%
|
|================================================================== | 94%
|
|================================================================== | 95%
|
|=================================================================== | 95%
|
|=================================================================== | 96%
|
|==================================================================== | 97%
|
|==================================================================== | 98%
|
|===================================================================== | 98%
|
|===================================================================== | 99%
|
|======================================================================| 99%
|
|======================================================================| 100%
head(wv_pop) # or View(wv_pop) if you prefer
## Simple feature collection with 6 features and 5 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -82.64474 ymin: 37.84242 xmax: -80.16639 ymax: 39.72189
## geographic CRS: NAD83
## GEOID NAME variable estimate moe
## 1 54039 Kanawha County, West Virginia B01003_001 183279 NA
## 2 54007 Braxton County, West Virginia B01003_001 14190 NA
## 3 54033 Harrison County, West Virginia B01003_001 67908 NA
## 4 54085 Ritchie County, West Virginia B01003_001 9844 NA
## 5 54103 Wetzel County, West Virginia B01003_001 15436 NA
## 6 54099 Wayne County, West Virginia B01003_001 40303 NA
## geometry
## 1 MULTIPOLYGON (((-81.90454 3...
## 2 MULTIPOLYGON (((-80.98495 3...
## 3 MULTIPOLYGON (((-80.60255 3...
## 4 MULTIPOLYGON (((-81.32879 3...
## 5 MULTIPOLYGON (((-80.94378 3...
## 6 MULTIPOLYGON (((-82.6432 38...
So far we haven’t done anything out of left field, in that it the process and results are pretty similar to what came before. It is at this point you should note that there is something much more interesting coming. Let’s redo the plot above using a very cool package called leaflet
. Let’s lod (or install and load) some new libraries:
library(leaflet) # a JavaScript library for interactive maps
library(stringr) # Well I lied...this isn't new!
library(sf) # A very nice package that allows us to us a standardized way to encode spatial vector data, aka spatial objects.
## Linking to GEOS 3.8.1, GDAL 3.1.4, PROJ 6.3.1
I cheated a little and know how many colors we’ll need (10). Below is one of many ways to construct a continuous scale.
pal <- colorQuantile(palette = "viridis", domain = wv_pop$estimate, n = 10)
Now it gets interesting (well for me anyways…yes I find maps fun). Below we’re going to construct an interactive plot. You can click on the individual counties to see their names:
wv_pop %>%
st_transform(crs = "+init=epsg:4326") %>%
leaflet(width = "100%") %>%
addProviderTiles(provider = "CartoDB.Positron") %>%
addPolygons(popup = ~ str_extract(NAME, "^([^,]*)"),
stroke = FALSE,
smoothFactor = 0,
fillOpacity = 0.7,
color = ~ pal(estimate)) %>%
addLegend("bottomright",
pal = pal,
values = ~ estimate,
title = "Population percentiles",
opacity = 1)
## Warning in CPL_crs_from_input(x): GDAL Message 1: +init=epsg:XXXX syntax is
## deprecated. It might return a CRS with a non-EPSG compliant axis order.
Want an explanation of the code? If so, keep reading. If you have fallen asleep, maybe move on to the next batch of codes.
Note about geographic information system (GIS)
What the heck is st_transform(crs = "+init=epsg:4326")
? Well without getting too into GIS because that is an entire semesters worth of material in itself, the epsg
part specifically refers to an optimal coordinate system first developed by the European Petroleum Survey Group, an organization involved in best practices for surveying and applied geodesy. The EPSG was absorbed into the International Association of Oil & Gas Producers in 2005 but the code epsg
has become a standard.
If you want to know more about GIS in general, let me know as I am considering teaching a section. Again, maps are fun! Inn the meantime, to know more about GIS in general, consider checking out the ersi homepage: https://www.esri.com/en-us/what-is-gis/overview. They make the most popular software package for GIS called ArcGIS and thought it is expensive (and I mean expensive) in the outside world, much like Tableau, it is free for students and educators alike. Moreover, companies and organizations that use ArcGIS use it heavily and are more likely than not well aware of the investment. Moreover, mapping geometries (and by this I mean anything from GPS to LiDAR to measuring distances on your phone and more) is a lucrative area to be in. Take a look at this recent report by National Geographic on how wrong we were about the size and population of the Mayan people which absolutely blew my mind: https://news.nationalgeographic.com/2018/02/maya-laser-lidar-guatemala-pacunam/. That is simply one amazing facet of GIS!
OK what were we doing? Oh yeah. OK back to our maps.
Let’s now look at one of our border states - Pennsylvania. In this case, we are going to use data regarding Pennsylvania’s population by county and proportional estimates by pulling from the ACS data set B01003_001
.
# Use the acs_16 data set we created
acs16 %>%
# filter everything else out except where we find B01003_001
filter(name == "B01003_001") %>%
# select the column concept that provides a very brief description
select(concept)
## # A tibble: 1 x 1
## concept
## <chr>
## 1 TOTAL POPULATION
# Pretty much the exact same pull we did above...
pa_pop <- get_acs(geography = "county",
variables = "B01003_001",
state = "PA",
geometry = TRUE)
## Getting data from the 2015-2019 5-year ACS
## Downloading feature geometry from the Census website. To cache shapefiles for use in future sessions, set `options(tigris_use_cache = TRUE)`.
# ... but this time we'll color it differently.
pal <- colorNumeric(palette = "plasma",
domain = pa_pop$estimate)
# GIS things. Look at the code. Most of it is repetitive and for the most part, I think you should be able to figure out what's generally going on.
pa_pop %>%
st_transform(crs = "+init=epsg:4326") %>%
leaflet(width = "100%") %>%
addProviderTiles(provider = "CartoDB.Positron") %>%
addPolygons(popup = ~ str_extract(NAME, "^([^,]*)"),
stroke = FALSE,
smoothFactor = 0,
fillOpacity = 0.7,
color = ~ pal(estimate)) %>%
addLegend("bottomright",
pal = pal,
values = ~ estimate,
title = "County Populations",
opacity = 1)
Well the output makes sense. PA is a pretty rural state barring areas close to Washington DC and just Pittsburgh. Let’s compare it to an a state like Florida that is overtly populated:
# Again, use the acs_16 data set we created
fl_pop <- get_acs(geography = "county",
variables = "B01003_001",
state = "FL",
geometry = TRUE)
## Getting data from the 2015-2019 5-year ACS
## Downloading feature geometry from the Census website. To cache shapefiles for use in future sessions, set `options(tigris_use_cache = TRUE)`.
# Assign out color palette for comparison
pal <- colorNumeric(palette = "plasma",
domain = fl_pop$estimate)
# GIS things again. Notice that the base code is consistent.
fl_pop %>%
st_transform(crs = "+init=epsg:4326") %>%
leaflet(width = "100%") %>%
addProviderTiles(provider = "CartoDB.Positron") %>%
addPolygons(popup = ~ str_extract(NAME, "^([^,]*)"),
stroke = FALSE,
smoothFactor = 0,
fillOpacity = 0.7,
color = ~ pal(estimate)) %>%
addLegend("bottomright",
pal = pal,
values = ~ estimate,
title = "County Populations",
opacity = 1)