Code
::install_github("heli-xu/findSVI@customized-boundaries") devtools
An example using US commuting zones for 2020
Heli Xu
April 29, 2024
Here we are using commuting zones as an example to demonstrate the new feature in findSVI r package to support user-defined boundaries for SVI calculation.
The current CRAN version of findSVI package supports all Census geographies, and census data retrieval and SVI calculation are performed at the same geographic level. For users that would like to calculate SVI at additional geographic levels in which the Census geographies are fully nested, we are incorporating a new function get_svi_x()
to calculate SVI for customized boundaries.
First we are installing the package via github customized-boundaries
branch and loading the packages needed.
The general workflow of SVI calculation at a customized geographic level:
Input to supply: a year and a customized geogrpahic level of interest, and a crosswalk between a Census geographic level and the customized level.
Retrieve ACS data at a Census geographic level using get_census_data()
with exp = TRUE
.
Supply the data and crosswalk in get_svi_x()
to aggregate data from the Census geographic level to the customized level and calculate SVI.
For example, to calculate SVI with geometries for US commuting zones (that are consisted of counties) for 2020, we can use the following:
To visualize the overall SVI for each CZ in the US (the higher the SVI, the more vulnerable a community is considered):
continent <- ggplot()+
geom_sf(data = svi_geo, aes(fill = RPL_themes))+
coord_sf(xlim = c(-130, -60), ylim = c(23, 50)) +
# Adjust xlim and ylim to focus on continental US
scale_fill_viridis_c(option = "inferno", direction = -1) +
labs(
title = "Social Vulnerability Index of US Commuting Zones (2020)",
caption = "Data source: Census ACS and https://sites.psu.edu/psucz/",
fill = "Overall SVI"
) +
theme_minimal()+
theme(
plot.title = element_text(face = "bold", hjust = 0.5),
plot.caption = element_text(size = 8),
axis.text = element_blank(),
panel.grid = element_blank(),
legend.key.size = unit(0.5, "cm"),
legend.position = "inside",
legend.position.inside = c(0.9, 0.3),
legend.frame = element_rect(color = "white", linewidth = 1),
#legend.background = element_rect(color = "black"),
legend.box.background = element_rect(color = "black"),
legend.box.margin = margin(0,1,3,0, unit = "mm")
)
#Alaska
inset_ak <- ggplot() +
geom_sf(data = svi_geo, aes(fill = RPL_themes)) +
coord_sf(xlim = c(-179.5,-130), ylim = c(51.2, 71.5)) +
scale_fill_viridis_c(option = "inferno", direction = -1) +
theme_minimal()+
theme(
axis.text = element_blank(),
panel.grid = element_blank(),
legend.position = "none"
)
#Hawaii
inset_hi <- ggplot()+
geom_sf(data = svi_geo, aes(fill = RPL_themes))+
coord_sf(xlim = c(-178.3, -154.8), ylim = c(18.9, 28.5)) +
scale_fill_viridis_c(option = "inferno", direction = -1) +
theme_minimal()+
theme(
axis.text = element_blank(),
panel.grid = element_blank(),
legend.position = "none"
)
ggdraw(continent) +
draw_plot(inset_ak, -0.3, -0.3, scale = 0.3) +
draw_plot(inset_hi, -0.2, -0.3, scale = 0.3)
In the following sections, we’ll discuss more details in function updates and validation.
First proposed by Tolbert and Killian in 1987 and updated by Fowler, Jensen and Rhubart in 2016, commuting zones (CZs) offer a geographic delineation to understand regional labor market, taking into account the commuter flow data and Ameican Community Survey (ACS) data. Geographically, CZs are based on counties and cover the entire US. Therefore, to calculate SVI for CZs, we can retrieve the data at the county level and combined the counties to CZs level.
exp
Currently, the variable retrieval process by get_census_data()
follows the CDC/ATSDR SVI documentation strictly, where SVI variables starting with “EP_” (percent estimate) are sometimes obtained directly from Census, while in other cases are values computed from Census variables. For example, in the variable table below for 2020, the calculation field (the last column) for EP_UNEMP
is a Census variable and a calculation formula for EP_POV150
. As a result, DP03_0009PE
from Census is directly used as EP_UNEMP
, whereas S1701_C01_001E
is obtained from Census and used to calculate EP_POV150
(along with E_POV150
).
Table 1: ‘EP_’ variables in variable_e_ep_calculation_2020
However, for aggregation purposes, percent estimates cannot be summed up directly to a larger geographic level. Instead, we need to use summed “E_” SVI variables (count estimate) and the corresponding “total” count to calculate the CZ-level percent. Therefore, we need to modify variables like DP03_0009PE
for EP_UNEMP
so that the calculation field for all “EP_” variables are formula with explicitly defined denominator (total counts). For example:
Table 2: ‘EP_’ variables in variable_cal_exp_2020
To integrate the modified variable lists to get_census_data()
, we introduce another argument exp
, where we can specify which of the two variables list is used:
(Default) exp = FALSE
: use variables following CDC/ATSDR documentation.
exp = TRUE
: use variables with explicitly defined denominator.
For use with get_svi_x()
, exp = TRUE
is recommended.
Currently, exp = TRUE
and get_svi_x()
only work with data for 2020.
In the example at the beginning of the post, we retrieved the census data for all US counties in 2020 with simple feature geometries for spatial analysis. Below shows the first 10 rows of the retrieved data, with columns being the Census variable names (and last column being the simple features).
Table 3: Census data with geometries for all US counties in 2020
To retrieve data without spatial information, we can leave out the geometry
argument (geometry = FALSE
is default) during get_census_data()
to keep only the variables.
To validate the modified variable list and the calculation table, we’ll generate a pseudo-crosswalk with replicating the county IDs (county-county crosswalk, first 10 rows shown below):
GEOID | NAME | GEOID2 |
---|---|---|
01001 | Autauga County, Alabama | 01001 |
01003 | Baldwin County, Alabama | 01003 |
01005 | Barbour County, Alabama | 01005 |
01007 | Bibb County, Alabama | 01007 |
01009 | Blount County, Alabama | 01009 |
01011 | Bullock County, Alabama | 01011 |
01013 | Butler County, Alabama | 01013 |
01015 | Calhoun County, Alabama | 01015 |
01017 | Chambers County, Alabama | 01017 |
01019 | Cherokee County, Alabama | 01019 |
Table 4: County-county pseudo-crosswalk (first 10 rows)
In the crosswalk, GEOID
is the Census geography, and GEOID2
is the customized geography. Supplying this crosswalk to get_svi_x()
, we’ll calculate the county-level SVI for US and compare the result with CDC/ATSDR SVI database using a scatter plot.
svi_test <- get_svi_x(
year = 2020,
data = data,
xwalk = ps_xwalk
)
cdc_cty_svi2020 <- read_csv("datasets/SVI_2020_US_county.csv")
join_RPL <- cdc_cty_svi2020 %>%
select(GEOID = FIPS,
cdc_RPL_themes = RPL_THEMES,
cdc_RPL_theme1 = RPL_THEME1,
cdc_RPL_theme2 = RPL_THEME2,
cdc_RPL_theme3 = RPL_THEME3,
cdc_RPL_theme4 = RPL_THEME4) %>%
mutate(GEOID = paste(GEOID)) %>%
left_join(svi_test %>%
select(GEOID,
RPL_themes,
RPL_theme1,
RPL_theme2,
RPL_theme3,
RPL_theme4)) %>%
drop_na() %>% ## remove NA rows
filter_all(all_vars(. >= 0)) #-999 in cdc data
coeff1 <- cor(join_RPL$cdc_RPL_themes, join_RPL$RPL_themes)
join_RPL %>%
ggplot(aes(x = cdc_RPL_themes, y = RPL_themes)) +
geom_point(color = "#004C54")+
geom_abline(slope = 1, intercept = 0)+
labs(title = "CDC vs. findSVI CTY-level SVI for US in 2020",
subtitle = paste0("Comparison of overall percentile ranking (RPLs), correlation coefficient = ", coeff1),
y = "findSVI",
x = "CDC")+
theme(plot.title = element_text(size= 15))
With strong correlation between the our calculation and CDC/ATSDR SVI result, we’ll continue with the “real” crosswalk to calculate the CZ-level SVI.
(If the correlation holds up for the other year, it may be worth considering to replace the current variable lists and calculation tables with the explicit-denominator version, so that the exp
argument wouldn’t be necessary.)
get_svi_x()
relies on a user-defined crosswalk (relationship file) between the Census geography and customized geography. The crosswalk between CZ and county for 2020 is downloaded from the Penn State Commuting Zones / Labor Markets data repository, and modified to keep only the IDs for counties and CZs. First 10 rows of the crosswalk look like this:
GEOID | GEOID2 |
---|---|
01069 | 3 |
01023 | 9 |
01005 | 3 |
01107 | 4 |
01033 | 10 |
04012 | 37 |
04001 | 32 |
05081 | 55 |
05121 | 46 |
06037 | 37 |
Table 5: County-CZ crosswalk cty_cz_2020_xwalk
(first 10 rows)
The full table cty_cz_2020_xwalk
is stored in the package as an example and a template for the crosswalk. Note that the crosswalk should be a data frame, with column names GEOID
representing the Census geography and GEOID2
representing the user-defined geography. GEOID
should be completed nested in GEOID2
so that the census data can be accurately aggregated to the customized geographic level. For example, counties are nested in CZs, and the number of counties in each CZ for the first 10 CZs is shown below:
GEOID2 | number_of_counties |
---|---|
1 | 8 |
2 | 5 |
3 | 11 |
4 | 9 |
5 | 8 |
6 | 10 |
7 | 13 |
8 | 5 |
9 | 8 |
10 | 8 |
Table 6: Number of counties in CZ 1-10
Optionally, the crosswalk can include another column NAME
for the description or name of the user-defined geography, which will be included in the final SVI output.
The goal of the new function get_svi_x()
is to extend the functionality of get_svi()
for customized geographic levels. Inputs for get_svi_x()
are overall consistent with get_svi()
, except for an additional crosswalk specifying the relationship between the Census geography and the customized geography.
Using the county-level census data and the county-CZ crosswalk, get_svi_x()
sums up the count variables from the county level to the CZ level, and calculates the percent estimates, theme-specific and overall SVI for CZs. Below we are showing the first 50 rows of the result (column names follow the CDC SVI documentation 2020):
svi <- get_svi_x(
year = 2020,
data = data,
xwalk = cty_cz_2020_xwalk
)
svi %>%
head(50) %>%
reactable(
columns = list(
GEOID = colDef(
sticky = "left",
style = sticky_style,
headerStyle = sticky_style
)
),
# theme = reactableTheme(color = "#002b36"),
defaultPageSize = 10,
resizable = TRUE,
bordered = TRUE,
wrap = FALSE
)
Table 7: CZ-level SVI results for 2020
As mentioned before, for spatial analysis, we can choose to include the geometries during census data retrieval (data_geo
), and supply that to get_svi_x()
with the crosswalk. Geometries from the census data will be merged to the customized level according to the crosswalk, along with the aggregation of the attributes. (Depending on the crosswalk and geometries, this process will likely take longer.)