SVI Calculation for Customized Boundaries with findSVI package

An example using US commuting zones for 2020

Author

Heli Xu

Published

April 29, 2024

Here we are using commuting zones as an example to demonstrate the new feature in findSVI r package to support user-defined boundaries for SVI calculation.

The current CRAN version of findSVI package supports all Census geographies, and census data retrieval and SVI calculation are performed at the same geographic level. For users that would like to calculate SVI at additional geographic levels in which the Census geographies are fully nested, we are incorporating a new function get_svi_x() to calculate SVI for customized boundaries.

First we are installing the package via github customized-boundaries branch and loading the packages needed.

Code
devtools::install_github("heli-xu/findSVI@customized-boundaries")
Code
library(dplyr)
library(findSVI)
library(sf)
library(reactable)
library(ggplot2)
library(cowplot)
library(stringr)
library(readr)
library(tidyr)

Quick Rundown

The general workflow of SVI calculation at a customized geographic level:

  • Input to supply: a year and a customized geogrpahic level of interest, and a crosswalk between a Census geographic level and the customized level.

  • Retrieve ACS data at a Census geographic level using get_census_data() with exp = TRUE.

  • Supply the data and crosswalk in get_svi_x() to aggregate data from the Census geographic level to the customized level and calculate SVI.

For example, to calculate SVI with geometries for US commuting zones (that are consisted of counties) for 2020, we can use the following:

Code
data_geo <- get_census_data(
  year = 2020, 
  geography = "county",
  geometry = TRUE,
  exp = TRUE
)
Code
svi_geo <- get_svi_x(
  year = 2020,
  data = data_geo,
  xwalk = cty_cz_2020_xwalk #county-commuting zone crosswalk
)

To visualize the overall SVI for each CZ in the US (the higher the SVI, the more vulnerable a community is considered):

Code
continent <- ggplot()+ 
  geom_sf(data = svi_geo, aes(fill = RPL_themes))+
  coord_sf(xlim = c(-130, -60), ylim = c(23, 50)) +  
  # Adjust xlim and ylim to focus on continental US
  scale_fill_viridis_c(option = "inferno", direction = -1) +
  labs(
    title = "Social Vulnerability Index of US Commuting Zones (2020)",
    caption = "Data source: Census ACS and https://sites.psu.edu/psucz/",
    fill = "Overall SVI"
    ) +
  theme_minimal()+
  theme(
    plot.title = element_text(face = "bold", hjust = 0.5),
    plot.caption = element_text(size = 8),
    axis.text = element_blank(),
    panel.grid = element_blank(),
    legend.key.size = unit(0.5, "cm"),
    legend.position = "inside",
    legend.position.inside = c(0.9, 0.3),
    legend.frame = element_rect(color = "white", linewidth = 1),
    #legend.background = element_rect(color = "black"),
    legend.box.background = element_rect(color = "black"),
    legend.box.margin = margin(0,1,3,0, unit = "mm")
  )

#Alaska
inset_ak <- ggplot() +
  geom_sf(data = svi_geo, aes(fill = RPL_themes)) +
  coord_sf(xlim = c(-179.5,-130), ylim = c(51.2, 71.5)) +
  scale_fill_viridis_c(option = "inferno", direction = -1) +
  theme_minimal()+
  theme(
    axis.text = element_blank(),
    panel.grid = element_blank(),
    legend.position = "none"
  )

#Hawaii
inset_hi <- ggplot()+ 
  geom_sf(data = svi_geo, aes(fill = RPL_themes))+
  coord_sf(xlim = c(-178.3, -154.8), ylim = c(18.9, 28.5)) +  
  scale_fill_viridis_c(option = "inferno", direction = -1) +
  theme_minimal()+
  theme(
    axis.text = element_blank(),
    panel.grid = element_blank(),
    legend.position = "none"
  )

ggdraw(continent) + 
  draw_plot(inset_ak, -0.3, -0.3, scale = 0.3) +
  draw_plot(inset_hi, -0.2, -0.3, scale = 0.3)

In the following sections, we’ll discuss more details in function updates and validation.

Retrieve census data for aggregation

First proposed by Tolbert and Killian in 1987 and updated by Fowler, Jensen and Rhubart in 2016, commuting zones (CZs) offer a geographic delineation to understand regional labor market, taking into account the commuter flow data and Ameican Community Survey (ACS) data. Geographically, CZs are based on counties and cover the entire US. Therefore, to calculate SVI for CZs, we can retrieve the data at the county level and combined the counties to CZs level.

Additional argument exp

Currently, the variable retrieval process by get_census_data() follows the CDC/ATSDR SVI documentation strictly, where SVI variables starting with “EP_” (percent estimate) are sometimes obtained directly from Census, while in other cases are values computed from Census variables. For example, in the variable table below for 2020, the calculation field (the last column) for EP_UNEMP is a Census variable and a calculation formula for EP_POV150. As a result, DP03_0009PE from Census is directly used as EP_UNEMP, whereas S1701_C01_001E is obtained from Census and used to calculate EP_POV150 (along with E_POV150).

Code
variable_e_ep_calculation_2020 %>%
  filter(
    !theme == 5, 
    str_starts(x2020_variable_name, "EP_")
    ) %>% 
  reactable(
    defaultPageSize = 16,
    resizable = TRUE,
    bordered = TRUE,
    wrap = FALSE
    )

Table 1: ‘EP_’ variables in variable_e_ep_calculation_2020

However, for aggregation purposes, percent estimates cannot be summed up directly to a larger geographic level. Instead, we need to use summed “E_” SVI variables (count estimate) and the corresponding “total” count to calculate the CZ-level percent. Therefore, we need to modify variables like DP03_0009PE for EP_UNEMP so that the calculation field for all “EP_” variables are formula with explicitly defined denominator (total counts). For example:

Code
variable_cal_exp_2020 %>% 
  filter(
    !theme == 5, 
    str_starts(x2020_variable_name, "EP_")
    ) %>% 
  reactable(
    defaultPageSize = 16,
    resizable = TRUE,
    bordered = TRUE,
    wrap = FALSE
    )

Table 2: ‘EP_’ variables in variable_cal_exp_2020

To integrate the modified variable lists to get_census_data(), we introduce another argument exp, where we can specify which of the two variables list is used:

  • (Default) exp = FALSE: use variables following CDC/ATSDR documentation.

  • exp = TRUE: use variables with explicitly defined denominator.

For use with get_svi_x(), exp = TRUE is recommended.

available years

Currently, exp = TRUE and get_svi_x()only work with data for 2020.

Geometry option

In the example at the beginning of the post, we retrieved the census data for all US counties in 2020 with simple feature geometries for spatial analysis. Below shows the first 10 rows of the retrieved data, with columns being the Census variable names (and last column being the simple features).

Table 3: Census data with geometries for all US counties in 2020

To retrieve data without spatial information, we can leave out the geometry argument (geometry = FALSE is default) during get_census_data() to keep only the variables.

Code
data <- get_census_data(
  year = 2020,
  geography = "county",
  exp = TRUE
)

Validate with a pseudo-crosswalk

To validate the modified variable list and the calculation table, we’ll generate a pseudo-crosswalk with replicating the county IDs (county-county crosswalk, first 10 rows shown below):

Code
ps_xwalk <- data %>% 
  select(GEOID, NAME) %>% 
  mutate(GEOID2 = GEOID)

ps_xwalk %>% head(10)
GEOID NAME GEOID2
01001 Autauga County, Alabama 01001
01003 Baldwin County, Alabama 01003
01005 Barbour County, Alabama 01005
01007 Bibb County, Alabama 01007
01009 Blount County, Alabama 01009
01011 Bullock County, Alabama 01011
01013 Butler County, Alabama 01013
01015 Calhoun County, Alabama 01015
01017 Chambers County, Alabama 01017
01019 Cherokee County, Alabama 01019

Table 4: County-county pseudo-crosswalk (first 10 rows)

In the crosswalk, GEOID is the Census geography, and GEOID2 is the customized geography. Supplying this crosswalk to get_svi_x(), we’ll calculate the county-level SVI for US and compare the result with CDC/ATSDR SVI database using a scatter plot.

Code
svi_test <- get_svi_x(
  year = 2020,
  data = data,
  xwalk = ps_xwalk
)

cdc_cty_svi2020 <- read_csv("datasets/SVI_2020_US_county.csv")

join_RPL <- cdc_cty_svi2020 %>%
    select(GEOID = FIPS,
      cdc_RPL_themes = RPL_THEMES,
      cdc_RPL_theme1 = RPL_THEME1,
      cdc_RPL_theme2 = RPL_THEME2,
      cdc_RPL_theme3 = RPL_THEME3,
      cdc_RPL_theme4 = RPL_THEME4) %>%
    mutate(GEOID = paste(GEOID)) %>%
    left_join(svi_test %>%
        select(GEOID,
          RPL_themes,
          RPL_theme1,
          RPL_theme2,
          RPL_theme3,
          RPL_theme4)) %>%
    drop_na() %>%   ## remove NA rows
    filter_all(all_vars(. >= 0)) #-999 in cdc data

coeff1 <- cor(join_RPL$cdc_RPL_themes, join_RPL$RPL_themes)

join_RPL %>% 
  ggplot(aes(x = cdc_RPL_themes, y = RPL_themes)) +
  geom_point(color = "#004C54")+
  geom_abline(slope = 1, intercept = 0)+
  labs(title = "CDC vs. findSVI CTY-level SVI for US in 2020",
    subtitle = paste0("Comparison of overall percentile ranking (RPLs), correlation coefficient = ", coeff1),
    y = "findSVI",
    x = "CDC")+
  theme(plot.title = element_text(size= 15))

With strong correlation between the our calculation and CDC/ATSDR SVI result, we’ll continue with the “real” crosswalk to calculate the CZ-level SVI.

(If the correlation holds up for the other year, it may be worth considering to replace the current variable lists and calculation tables with the explicit-denominator version, so that the exp argument wouldn’t be necessary.)

County-commuting zone crosswalk

get_svi_x() relies on a user-defined crosswalk (relationship file) between the Census geography and customized geography. The crosswalk between CZ and county for 2020 is downloaded from the Penn State Commuting Zones / Labor Markets data repository, and modified to keep only the IDs for counties and CZs. First 10 rows of the crosswalk look like this:

Code
cty_cz_2020_xwalk %>% 
  head(10)
GEOID GEOID2
01069 3
01023 9
01005 3
01107 4
01033 10
04012 37
04001 32
05081 55
05121 46
06037 37

Table 5: County-CZ crosswalk cty_cz_2020_xwalk (first 10 rows)

The full table cty_cz_2020_xwalk is stored in the package as an example and a template for the crosswalk. Note that the crosswalk should be a data frame, with column names GEOID representing the Census geography and GEOID2 representing the user-defined geography. GEOID should be completed nested in GEOID2 so that the census data can be accurately aggregated to the customized geographic level. For example, counties are nested in CZs, and the number of counties in each CZ for the first 10 CZs is shown below:

Code
cty_cz_2020_xwalk %>% 
  count(GEOID2) %>% 
  rename(number_of_counties = n) %>% 
  head(10)
GEOID2 number_of_counties
1 8
2 5
3 11
4 9
5 8
6 10
7 13
8 5
9 8
10 8

Table 6: Number of counties in CZ 1-10

Optionally, the crosswalk can include another column NAME for the description or name of the user-defined geography, which will be included in the final SVI output.

Aggregate and calculate SVI

The goal of the new function get_svi_x() is to extend the functionality of get_svi() for customized geographic levels. Inputs for get_svi_x() are overall consistent with get_svi(), except for an additional crosswalk specifying the relationship between the Census geography and the customized geography.

Without geometry

Using the county-level census data and the county-CZ crosswalk, get_svi_x() sums up the count variables from the county level to the CZ level, and calculates the percent estimates, theme-specific and overall SVI for CZs. Below we are showing the first 50 rows of the result (column names follow the CDC SVI documentation 2020):

Code
svi <- get_svi_x(
  year = 2020,
  data = data, 
  xwalk = cty_cz_2020_xwalk
)

svi %>% 
  head(50) %>% 
  reactable(
  columns = list(
    GEOID = colDef(
      sticky = "left",
      style = sticky_style,
      headerStyle = sticky_style
    )
  ),
#  theme = reactableTheme(color = "#002b36"),
  defaultPageSize = 10,
  resizable = TRUE,
  bordered = TRUE,
  wrap = FALSE
)

Table 7: CZ-level SVI results for 2020

With geometry

As mentioned before, for spatial analysis, we can choose to include the geometries during census data retrieval (data_geo), and supply that to get_svi_x() with the crosswalk. Geometries from the census data will be merged to the customized level according to the crosswalk, along with the aggregation of the attributes. (Depending on the crosswalk and geometries, this process will likely take longer.)