with aggregating CDC CT-level SVI to ZCTA level for PA in 2018 and 2020
Author
Heli Xu
Published
February 8, 2023
Currently, our SVI calculation logic is stored in R functions , and we’re in the process of package development. In responding to the request for PA SVI at the ZCTA level from 2017-2021, we’d like to include this report to provide the rationale and approaches used in validating the results. Any suggestions and feedback are greatly appreciated.
This report will cover the following sections:
SVI calculation and validation: Brief introduction to SVI calculation and variables preparation which are validated by comparing our county-/tract-level results to CDC’s SVI (example: PA, tract level, 2018);
PA ZCTA-level SVI 2018: Summary of the approaches for aggregating CDC tract-level data to ZCTA level and comparing it with our PA ZCTA-level result for 2018;
PA ZCTA-level SVI 2020: Comparing aggregated CDC ZCTA-level SVI with our PA ZCTA-level result for 2020.
SVI calculation and validation
As included in a separate R script (“function collection.R”), get_census_data()(using {tidycensus} under the hood) and get_svi() were used to calculate SVI from census data. The variables required for SVI calculation were either extracted from the dictionary published by CDC when SVI was released (for 2014, 2016, 2018, 2020), or modified from the adjacent years to account for minor changes (for 2015, 2017, 2019, 2021). Details about the variables preparation are included in a separate script (“svi variable prep.R”).
As part of the validation process for our R functions, county and census tract level SVI for certain states are calculated and compared with CDC-released SVI for the same year (if available) or adjacent year, making sure the two versions of SVIs by GEOID are well correlated. For example, our calculated SVI for PA in 2018 at census tract level is highly consistent with the CDC data for 2018, with a correlation coefficient above 0.99 for overall and theme-specific SVI.
With SVI calculated at the ZCTA level in PA from 2017-2021, here we’d like to check the results against the CDC-published SVI data for PA in the years that they are available (2018 and 2020). Since SVI data released by CDC are either at county or census tract level for each state, we need to add ZCTA information to the SVI tables and aggregate the data from census tract level to ZCTA level. ZCTA to census tract (ct) crosswalk were modified from the relationship files (decennial) from census.gov.
Add ZCTA information to CDC-released SVI
Using ZCTA to census tract crosswalk for 2010, we’re matching each census tract in the 2018 CDC SVI table with a corresponding ZCTA (first few tracts shown below):
We could briefly check if there’s any GEOID (ct) that didn’t get matched with a ZCTA code (none):
Code
zsvi_pa2018 %>%filter(is.na(ZCTA))
ST
STATE
ST_ABBR
STCNTY
COUNTY
GEOID
ZCTA
LOCATION
AREA_SQMI
E_TOTPOP
M_TOTPOP
E_HU
M_HU
E_HH
M_HH
E_POV
M_POV
E_UNEMP
M_UNEMP
E_PCI
M_PCI
E_NOHSDP
M_NOHSDP
E_AGE65
M_AGE65
E_AGE17
M_AGE17
E_DISABL
M_DISABL
E_SNGPNT
M_SNGPNT
E_MINRTY
M_MINRTY
E_LIMENG
M_LIMENG
E_MUNIT
M_MUNIT
E_MOBILE
M_MOBILE
E_CROWD
M_CROWD
E_NOVEH
M_NOVEH
E_GROUPQ
M_GROUPQ
EP_POV
MP_POV
EP_UNEMP
MP_UNEMP
EP_PCI
MP_PCI
EP_NOHSDP
MP_NOHSDP
EP_AGE65
MP_AGE65
EP_AGE17
MP_AGE17
EP_DISABL
MP_DISABL
EP_SNGPNT
MP_SNGPNT
EP_MINRTY
MP_MINRTY
EP_LIMENG
MP_LIMENG
EP_MUNIT
MP_MUNIT
EP_MOBILE
MP_MOBILE
EP_CROWD
MP_CROWD
EP_NOVEH
MP_NOVEH
EP_GROUPQ
MP_GROUPQ
EPL_POV
EPL_UNEMP
EPL_PCI
EPL_NOHSDP
SPL_THEME1
RPL_THEME1
EPL_AGE65
EPL_AGE17
EPL_DISABL
EPL_SNGPNT
SPL_THEME2
RPL_THEME2
EPL_MINRTY
EPL_LIMENG
SPL_THEME3
RPL_THEME3
EPL_MUNIT
EPL_MOBILE
EPL_CROWD
EPL_NOVEH
EPL_GROUPQ
SPL_THEME4
RPL_THEME4
SPL_THEMES
RPL_THEMES
F_POV
F_UNEMP
F_PCI
F_NOHSDP
F_THEME1
F_AGE65
F_AGE17
F_DISABL
F_SNGPNT
F_THEME2
F_MINRTY
F_LIMENG
F_THEME3
F_MUNIT
F_MOBILE
F_CROWD
F_NOVEH
F_GROUPQ
F_THEME4
F_TOTAL
E_UNINSUR
M_UNINSUR
EP_UNINSUR
MP_UNINSUR
E_DAYPOP
Aggregating ct data to ZCTA level
Generally speaking, ZCTAs represent larger areas than census tracts, but census tracts do not necessarily nest within ZCTAs – sometimes one census tract could correspond to multiple ZCTAs (while one or more of those ZCTAs also show up in other census tracts). In such cases, we might be aggregating values of a larger area to represent a smaller area. Considering the complicated relationship between ZCTAs and census tracts, here we’re subsetting the ct-specific GEOIDs with only one matching ZCTA (different ct can be matched to the same ZCTA).
The potential caveat is that we might be over-representing some census tracts for a ZCTA when it’s supposed to include multiple census tracts, but some of the census tracts are excluded because they’re also matched to other ZCTAs. An example will be provided later in this document.
Variables (“E_xx” and “EP_xx”)
For individual variables, we’re taking sum for the “E_xx” values by the ZCTA as they mostly represent counts, whereas for “EP_xx” values, we’re taking the mean as they represent a percentage of the total. Before 2018, the exception is “E_PCI”, which is the same value as “EP_PCI”, representing per capita income, and we’re taking the mean of that. First 15 rows of the aggregated table is shown below:
Code
#select ct:zcta = 1ct_zcta_r1_18 <- zsvi_pa2018 %>%group_by(GEOID) %>%count() %>%arrange(n) %>%filter(n==1) %>%pull(GEOID)#aggregate by types of variablesvar_table18 <-readRDS("../../../data/variable_e_ep_calculation_2018.rds")var_e18 <- var_table18 %>%filter(theme%in%c(0:4),str_detect(.[[1]], "E_")) %>%pull(1)var_ep18 <- var_table18 %>%filter(theme%in%c(0:4),str_detect(.[[1]], "EP_")) %>%pull(1)cdc18 <- zsvi_pa2018 %>%select(GEOID, ZCTA, all_of(var_e18), all_of(var_ep18)) %>%filter(GEOID%in%all_of(ct_zcta_r1_18)) %>%pivot_longer(-c(GEOID,ZCTA), names_to ="var_name", values_to ="value") %>%filter(value >=0) %>%# to remove -999 as NA in the tablegroup_by(ZCTA, var_name) %>%summarise(sum =sum(value),mean =mean(value)) %>%mutate(var_zcta =case_when( var_name =="E_PCI"~ mean,str_starts(var_name, "E_") ~ sum,str_starts(var_name, "EP_") ~ mean )) %>%ungroup() %>%select(-sum, -mean)cdc18 %>%head(15) %>%kable()
ZCTA
var_name
var_zcta
15001
EP_AGE17
17.716667
15001
EP_AGE65
20.466667
15001
EP_CROWD
0.850000
15001
EP_DISABL
20.466667
15001
EP_GROUPQ
2.416667
15001
EP_LIMENG
0.600000
15001
EP_MINRTY
32.383333
15001
EP_MOBILE
2.250000
15001
EP_MUNIT
6.100000
15001
EP_NOHSDP
7.816667
15001
EP_NOVEH
14.050000
15001
EP_PCI
24178.666667
15001
EP_POV
23.600000
15001
EP_SNGPNT
10.800000
15001
EP_UNEMP
7.500000
Percentile ranking (“RPL_xx”) by theme
For percentile ranking, for now we’re taking the mean by the ZCTA and we’ll get a table with first few rows looking like this: (I tried weighted by population and it didn’t seem to help too much with correlation.)
Correlation: aggregated CDC data vs. calculated result
To compare the aggregated CDC SVI data with our calculated SVI, we’re joining the CDC data with our result by each ZCTA, and we could check the correlation between the two versions of SVI in all ZCTAs.
For individual variables, below shows the correlation coefficient for each variable:
Code
result2018 <-readRDS("../../../cdc_us_svi/result/pa_zcta_result2018.rds") result18 <- result2018 %>%select(ZCTA = GEOID, all_of(var_e18), all_of(var_ep18)) %>%pivot_longer(-ZCTA, names_to ="var_name", values_to ="value_hx") %>%filter(value_hx >=0)cdc_result18 <- cdc18 %>%left_join(result18, by=c("ZCTA", "var_name")) %>%drop_na() %>%group_by(var_name) %>%mutate(cor =cor(var_zcta, value_hx)) %>%ungroup()p18_e <- cdc_result18 %>%select(var_name, cor) %>%filter(var_name%in%all_of(var_e18)) %>%distinct() %>%ggplot(aes(x=cor, y =reorder(var_name, cor)))+geom_col(fill ="#004C54")+xlim(0,1)+labs(y ="variable name",x ="corr. coeff.")p18_ep <- cdc_result18 %>%select(var_name, cor) %>%filter(var_name%in%all_of(var_ep18)) %>%distinct() %>%ggplot(aes(x=cor, y =reorder(var_name, cor)))+geom_col(fill ="#49592a")+xlim(0,1)+labs(y ="variable name",x ="corr. coeff.")p18_e+p18_ep+plot_annotation(title ="Correlation coeff. for each variable in SVI calculation",subtitle ="between calculated & CDC SVI in 2018 (zcta level)")&theme(plot.title =element_text(size =14))
For RPLs, we’ll compare RPL for each theme and overall RPL for all themes. Here is a plot for the overall RPL for all themes of each ZCTA:
In the plot above, we notice a data point that is close to 1 in calculated result, but less than 0.25 in CDC data. That point corresponds to ZCTA 15207. In CDC data, this ZCTA has the following census tracts:
But among these census tracts, almost all of them match to other ZCTAs, and the ones end up included in the aggregation are the following two (and the -999 gets further excluded):
That’s why the “aggregated” SVI for ZCTA15207 in fact only represents one of the census tracts in the area, leading to a discrepancy with our calculated result.
PA ZCTA-level SVI 2020
Similarly, we could aggregate and compare CDC-released SVI for 2020 at the census tract level with our calculated result at the ZCTA level.
First, we’re joining the CDC SVI table to a new ZCTA to census tract crosswalk updated in 2020. Here is a glance at the table:
Secondly, we’ll aggregate the CDC data from ct level to ZCTA level, including the variables and percentile rankings by theme (first 15 rows of each table is shown below).
Aggregated data for all variables:
Code
#select ct:zcta = 1ct_zcta_r1_20 <- zsvi_pa2020 %>%group_by(GEOID) %>%count() %>%arrange(n) %>%filter(n==1) %>%pull(GEOID)#aggregate by types of variablesvar_table20 <-readRDS("../../../data/variable_e_ep_calculation_2020.rds")var_e20 <- var_table20 %>%filter(theme%in%c(0:4),str_detect(.[[1]], "E_")) %>%pull(1)var_ep20 <- var_table20 %>%filter(theme%in%c(0:4),str_detect(.[[1]], "EP_")) %>%pull(1)cdc20 <- zsvi_pa2020 %>%select(GEOID, ZCTA, all_of(var_e20), all_of(var_ep20)) %>%filter(GEOID%in%all_of(ct_zcta_r1_20)) %>%pivot_longer(-c(GEOID,ZCTA), names_to ="var_name", values_to ="value") %>%filter(value >=0) %>%# to remove -999 as NA in the tablegroup_by(ZCTA, var_name) %>%summarise(sum =sum(value),mean =mean(value)) %>%mutate(var_zcta =case_when( var_name =="E_PCI"~ mean,str_starts(var_name, "E_") ~ sum,str_starts(var_name, "EP_") ~ mean )) %>%ungroup() %>%select(-sum, -mean)cdc20 %>%head(15) %>%kable()
After adding ZCTA information and aggregating the CDC data, we’re ready to compare them to our calculated result for 2020. Most of the variables turn out quite consistent between our calculation and aggregated CDC data, despite a few peculiar outliers (shown in the correlation plot), possibly due to over-representation of some census tracts.
Code
result2020 <-readRDS("../../../cdc_us_svi/result/pa_zcta_result2020.rds")result20 <- result2020 %>%select(ZCTA = GEOID, all_of(var_e20), all_of(var_ep20)) %>%pivot_longer(-ZCTA, names_to ="var_name", values_to ="value_hx") %>%filter(value_hx >=0)cdc_result20 <- cdc20 %>%left_join(result20, by=c("ZCTA", "var_name")) %>%drop_na() %>%group_by(var_name) %>%mutate(cor =cor(var_zcta, value_hx)) %>%ungroup()p20_e <- cdc_result20 %>%select(var_name, cor) %>%filter(var_name%in%all_of(var_e20)) %>%distinct() %>%ggplot(aes(x=cor, y =reorder(var_name, cor)))+geom_col(fill ="#004C54")+xlim(0,1)+labs(y ="variable name",x ="corr. coeff.")p20_ep <- cdc_result20 %>%select(var_name, cor) %>%filter(var_name%in%all_of(var_ep20)) %>%distinct() %>%ggplot(aes(x=cor, y =reorder(var_name, cor)))+geom_col(fill ="#49592a")+xlim(0,1)+labs(y ="variable name",x ="corr. coeff.")p20_e+p20_ep+plot_annotation(title ="Correlation coeff. for each variable in SVI calculation",subtitle ="between calculated & CDC SVI in 2020 (zcta level)")&theme(plot.title =element_text(size =15))