SVI Calculation - SVI Calculation Validation (ZCTA level)

Currently, our SVI calculation logic is stored in R functions , and we’re in the process of package development. In responding to the request for PA SVI at the ZCTA level from 2017-2021, we’d like to include this report to provide the rationale and approaches used in validating the results. Any suggestions and feedback are greatly appreciated.

This report will cover the following sections:

SVI calculation and validation: Brief introduction to SVI calculation and variables preparation which are validated by comparing our county-/tract-level results to CDC’s SVI (example: PA, tract level, 2018);
PA ZCTA-level SVI 2018: Summary of the approaches for aggregating CDC tract-level data to ZCTA level and comparing it with our PA ZCTA-level result for 2018;
PA ZCTA-level SVI 2020: Comparing aggregated CDC ZCTA-level SVI with our PA ZCTA-level result for 2020.

SVI calculation and validation

As included in a separate R script (“function collection.R”), get_census_data()(using {tidycensus} under the hood) and get_svi() were used to calculate SVI from census data. The variables required for SVI calculation were either extracted from the dictionary published by CDC when SVI was released (for 2014, 2016, 2018, 2020), or modified from the adjacent years to account for minor changes (for 2015, 2017, 2019, 2021). Details about the variables preparation are included in a separate script (“svi variable prep.R”).

As part of the validation process for our R functions, county and census tract level SVI for certain states are calculated and compared with CDC-released SVI for the same year (if available) or adjacent year, making sure the two versions of SVIs by GEOID are well correlated. For example, our calculated SVI for PA in 2018 at census tract level is highly consistent with the CDC data for 2018, with a correlation coefficient above 0.99 for overall and theme-specific SVI.

Code

library(tidyverse)
library(patchwork)
library(knitr)


result_ct_pa2018 <- readRDS("../../../cdc_us_svi/result/pa_tract_result2018.rds")

svi_pa_2018 <- read_csv("../../../cdc_us_svi/cdc_svi_2018_pa_ct.csv") %>% 
  rename(GEOID = FIPS)

ct_check <- svi_pa_2018 %>% 
  select(
    GEOID, 
    cdc_RPL_themes = RPL_THEMES, 
    cdc_RPL_theme1 = RPL_THEME1,
    cdc_RPL_theme2 = RPL_THEME2,
    cdc_RPL_theme3 = RPL_THEME3,
    cdc_RPL_theme4 = RPL_THEME4
    ) %>%
  mutate(GEOID = paste(GEOID)) %>% 
  left_join(
    result_ct_pa2018 %>% 
      select(
        GEOID, 
        RPL_themes,
        RPL_theme1,
        RPL_theme2,
        RPL_theme3, 
        RPL_theme4
        )
  ) 

# ct_check %>% 
#   filter(is.na(RPL_theme1)) #%>% kable()

ct_corr1 <- ct_check %>% 
  drop_na() %>%   ## remove NA rows
  filter_all(all_vars(.>=0)) %>%
  transmute(overall = cor(cdc_RPL_themes, RPL_themes),
    theme1 = cor(cdc_RPL_theme1, RPL_theme1),
    theme2 = cor(cdc_RPL_theme2, RPL_theme2),
    theme3 = cor(cdc_RPL_theme3, RPL_theme3),
    theme4 = cor(cdc_RPL_theme4, RPL_theme4)) %>% 
  distinct() %>% 
 pivot_longer(1:5, names_to = "theme", values_to = "value") %>% 
  ggplot()+
  geom_col(aes(x=theme, y = value), fill= "#004C54")+
  labs(y = "Corr. coeff.")+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

ct_corr2 <- ct_check %>% 
  drop_na() %>% 
  filter_all(all_vars(.>=0)) %>% 
  ggplot(aes(x = cdc_RPL_themes, y = RPL_themes)) +
  geom_point(color = "#004C54")+
  geom_abline(slope = 1, intercept = 0)+
  labs(y = "calculated overall RPL",
    x = "CDC overall RPL")

ct_corr1+ct_corr2+
  plot_annotation(title = "Correlation coeff. for percentile rankings (RPLs)",
    subtitle = "between calculated & CDC SVI in 2018 (ct level)")&
  theme(plot.title = element_text(size = 14))

PA ZCTA-level SVI 2018

With SVI calculated at the ZCTA level in PA from 2017-2021, here we’d like to check the results against the CDC-published SVI data for PA in the years that they are available (2018 and 2020). Since SVI data released by CDC are either at county or census tract level for each state, we need to add ZCTA information to the SVI tables and aggregate the data from census tract level to ZCTA level. ZCTA to census tract (ct) crosswalk were modified from the relationship files (decennial) from census.gov.

Add ZCTA information to CDC-released SVI

Using ZCTA to census tract crosswalk for 2010, we’re matching each census tract in the 2018 CDC SVI table with a corresponding ZCTA (first few tracts shown below):

Code

ct_zcta_xwalk2010 <- readRDS("../../../data/ct_zcta_xwalk2010.rds")



zsvi_pa2018 <- svi_pa_2018 %>% 
  left_join(ct_zcta_xwalk2010, by = "GEOID") %>% 
  relocate(ZCTA, .after = GEOID)

zsvi_pa2018 %>% head() %>%  select(1:7) %>% kable()

ST	STATE	ST_ABBR	STCNTY	COUNTY	GEOID	ZCTA
42	PENNSYLVANIA	PA	42001	Adams	42001030101	17019
42	PENNSYLVANIA	PA	42001	Adams	42001030101	17316
42	PENNSYLVANIA	PA	42001	Adams	42001030101	17324
42	PENNSYLVANIA	PA	42001	Adams	42001030101	17372
42	PENNSYLVANIA	PA	42001	Adams	42001030102	17316
42	PENNSYLVANIA	PA	42001	Adams	42001030102	17350

We could briefly check if there’s any GEOID (ct) that didn’t get matched with a ZCTA code (none):

Code

zsvi_pa2018 %>% filter(is.na(ZCTA))

ST	STATE	ST_ABBR	STCNTY	COUNTY	GEOID	ZCTA	LOCATION	AREA_SQMI	E_TOTPOP	M_TOTPOP	E_HU	M_HU	E_HH	M_HH	E_POV	M_POV	E_UNEMP	M_UNEMP	E_PCI	M_PCI	E_NOHSDP	M_NOHSDP	E_AGE65	M_AGE65	E_AGE17	M_AGE17	E_DISABL	M_DISABL	E_SNGPNT	M_SNGPNT	E_MINRTY	M_MINRTY	E_LIMENG	M_LIMENG	E_MUNIT	M_MUNIT	E_MOBILE	M_MOBILE	E_CROWD	M_CROWD	E_NOVEH	M_NOVEH	E_GROUPQ	M_GROUPQ	EP_POV	MP_POV	EP_UNEMP	MP_UNEMP	EP_PCI	MP_PCI	EP_NOHSDP	MP_NOHSDP	EP_AGE65	MP_AGE65	EP_AGE17	MP_AGE17	EP_DISABL	MP_DISABL	EP_SNGPNT	MP_SNGPNT	EP_MINRTY	MP_MINRTY	EP_LIMENG	MP_LIMENG	EP_MUNIT	MP_MUNIT	EP_MOBILE	MP_MOBILE	EP_CROWD	MP_CROWD	EP_NOVEH	MP_NOVEH	EP_GROUPQ	MP_GROUPQ	EPL_POV	EPL_UNEMP	EPL_PCI	EPL_NOHSDP	SPL_THEME1	RPL_THEME1	EPL_AGE65	EPL_AGE17	EPL_DISABL	EPL_SNGPNT	SPL_THEME2	RPL_THEME2	EPL_MINRTY	EPL_LIMENG	SPL_THEME3	RPL_THEME3	EPL_MUNIT	EPL_MOBILE	EPL_CROWD	EPL_NOVEH	EPL_GROUPQ	SPL_THEME4	RPL_THEME4	SPL_THEMES	RPL_THEMES	F_POV	F_UNEMP	F_PCI	F_NOHSDP	F_THEME1	F_AGE65	F_AGE17	F_DISABL	F_SNGPNT	F_THEME2	F_MINRTY	F_LIMENG	F_THEME3	F_MUNIT	F_MOBILE	F_CROWD	F_NOVEH	F_GROUPQ	F_THEME4	F_TOTAL	E_UNINSUR	M_UNINSUR	EP_UNINSUR	MP_UNINSUR	E_DAYPOP

Aggregating ct data to ZCTA level

Generally speaking, ZCTAs represent larger areas than census tracts, but census tracts do not necessarily nest within ZCTAs – sometimes one census tract could correspond to multiple ZCTAs (while one or more of those ZCTAs also show up in other census tracts). In such cases, we might be aggregating values of a larger area to represent a smaller area. Considering the complicated relationship between ZCTAs and census tracts, here we’re subsetting the ct-specific GEOIDs with only one matching ZCTA (different ct can be matched to the same ZCTA).

The potential caveat is that we might be over-representing some census tracts for a ZCTA when it’s supposed to include multiple census tracts, but some of the census tracts are excluded because they’re also matched to other ZCTAs. An example will be provided later in this document.

Variables (“E_xx” and “EP_xx”)

For individual variables, we’re taking sum for the “E_xx” values by the ZCTA as they mostly represent counts, whereas for “EP_xx” values, we’re taking the mean as they represent a percentage of the total. Before 2018, the exception is “E_PCI”, which is the same value as “EP_PCI”, representing per capita income, and we’re taking the mean of that. First 15 rows of the aggregated table is shown below:

Code

#select ct:zcta = 1
ct_zcta_r1_18 <- zsvi_pa2018 %>% group_by(GEOID) %>%
  count() %>%
  arrange(n) %>%
  filter(n==1) %>%
  pull(GEOID)

#aggregate by types of variables
var_table18 <- readRDS("../../../data/variable_e_ep_calculation_2018.rds")

var_e18 <- var_table18 %>% 
  filter(theme%in%c(0:4),
    str_detect(.[[1]], "E_")) %>% 
  pull(1)

var_ep18 <- var_table18 %>% 
  filter(theme%in%c(0:4),
    str_detect(.[[1]], "EP_")) %>% 
  pull(1)

cdc18 <- zsvi_pa2018 %>% 
  select(GEOID, ZCTA, all_of(var_e18), all_of(var_ep18)) %>%
  filter(GEOID%in%all_of(ct_zcta_r1_18)) %>% 
  pivot_longer(-c(GEOID,ZCTA), names_to = "var_name", values_to = "value") %>% 
  filter(value >= 0) %>%   # to remove -999 as NA in the table
  group_by(ZCTA, var_name) %>% 
  summarise(sum = sum(value),
    mean = mean(value)) %>%
  mutate(var_zcta = case_when(
    var_name == "E_PCI" ~ mean,
    str_starts(var_name, "E_") ~ sum,
    str_starts(var_name, "EP_") ~ mean
  )) %>%
  ungroup() %>% 
  select(-sum, -mean)

cdc18 %>% head(15) %>% kable()

ZCTA	var_name	var_zcta
15001	EP_AGE17	17.716667
15001	EP_AGE65	20.466667
15001	EP_CROWD	0.850000
15001	EP_DISABL	20.466667
15001	EP_GROUPQ	2.416667
15001	EP_LIMENG	0.600000
15001	EP_MINRTY	32.383333
15001	EP_MOBILE	2.250000
15001	EP_MUNIT	6.100000
15001	EP_NOHSDP	7.816667
15001	EP_NOVEH	14.050000
15001	EP_PCI	24178.666667
15001	EP_POV	23.600000
15001	EP_SNGPNT	10.800000
15001	EP_UNEMP	7.500000

Percentile ranking (“RPL_xx”) by theme

For percentile ranking, for now we’re taking the mean by the ZCTA and we’ll get a table with first few rows looking like this: (I tried weighted by population and it didn’t seem to help too much with correlation.)

Code

cdc18_2 <- zsvi_pa2018 %>% 
  drop_na(ZCTA) %>% 
  filter_all(all_vars(.>=0)) %>% 
  filter(GEOID%in%all_of(ct_zcta_r1_18)) %>% 
  group_by(ZCTA) %>% 
  summarise(
    cdc_RPL_themes = mean(RPL_THEMES), 
    cdc_RPL_theme1 = mean(RPL_THEME1),
    cdc_RPL_theme2 = mean(RPL_THEME2),
    cdc_RPL_theme3 = mean(RPL_THEME3),
    cdc_RPL_theme4 = mean(RPL_THEME4)) %>% 
  mutate(ZCTA = paste(ZCTA)) %>% 
  ungroup()

cdc18_2 %>% head(10) %>% kable()

ZCTA	cdc_RPL_themes	cdc_RPL_theme1	cdc_RPL_theme2	cdc_RPL_theme3	cdc_RPL_theme4
15001	0.6241000	0.6000500	0.6283000	0.5522167	0.5931333
15003	0.7153667	0.6803333	0.6176667	0.5735667	0.6555000
15005	0.3658000	0.3362000	0.5917000	0.1410000	0.5667000
15009	0.2006250	0.2957250	0.5000000	0.1411000	0.2125000
15010	0.7475750	0.7720000	0.6890250	0.5234500	0.6250750
15012	0.4354000	0.5984000	0.2847000	0.4928000	0.2996000
15017	0.4439667	0.4159333	0.5477667	0.2490000	0.5918000
15021	0.5301000	0.7184000	0.7387000	0.2730000	0.2379000
15022	0.7778667	0.6937000	0.9548667	0.2902000	0.7480667
15024	0.1514000	0.2757000	0.3307000	0.0088000	0.3312000

Correlation: aggregated CDC data vs. calculated result

To compare the aggregated CDC SVI data with our calculated SVI, we’re joining the CDC data with our result by each ZCTA, and we could check the correlation between the two versions of SVI in all ZCTAs.

For individual variables, below shows the correlation coefficient for each variable:

Code

result2018 <- readRDS("../../../cdc_us_svi/result/pa_zcta_result2018.rds") 

result18 <- result2018 %>% 
  select(ZCTA = GEOID, all_of(var_e18), all_of(var_ep18)) %>% 
  pivot_longer(-ZCTA, names_to = "var_name", values_to = "value_hx") %>% 
  filter(value_hx >= 0)

cdc_result18 <- cdc18 %>% 
  left_join(result18, by= c("ZCTA", "var_name")) %>% 
  drop_na() %>% 
  group_by(var_name) %>% 
  mutate(cor = cor(var_zcta, value_hx)) %>% 
  ungroup()

p18_e <- cdc_result18 %>% 
  select(var_name, cor) %>% 
  filter(var_name%in%all_of(var_e18)) %>% 
  distinct() %>% 
  ggplot(aes(x=cor, y = reorder(var_name, cor)))+
  geom_col(fill = "#004C54")+
  xlim(0,1)+
  labs(
    y = "variable name",
    x = "corr. coeff.")

p18_ep <- cdc_result18 %>% 
  select(var_name, cor) %>% 
  filter(var_name%in%all_of(var_ep18)) %>% 
  distinct() %>% 
  ggplot(aes(x=cor, y = reorder(var_name, cor)))+
  geom_col(fill = "#49592a")+
  xlim(0,1)+
  labs(y = "variable name",
    x = "corr. coeff.")

p18_e+p18_ep+
  plot_annotation(title = "Correlation coeff. for each variable in SVI calculation",
    subtitle = "between calculated & CDC SVI in 2018 (zcta level)")&
  theme(plot.title = element_text(size = 14))

For RPLs, we’ll compare RPL for each theme and overall RPL for all themes. Here is a plot for the overall RPL for all themes of each ZCTA:

Code

cdc_result18_2 <- cdc18_2 %>% 
  left_join(
     result2018 %>% 
      select(
        ZCTA = GEOID, 
        RPL_themes,
        RPL_theme1,
        RPL_theme2,
        RPL_theme3, 
        RPL_theme4
      )) %>%
  drop_na() %>% 
  mutate(cor_all = cor(cdc_RPL_themes, RPL_themes),
    cor1 = cor(cdc_RPL_theme1, RPL_theme1),
    cor2 = cor(cdc_RPL_theme2, RPL_theme2),
    cor3 = cor(cdc_RPL_theme3, RPL_theme3),
    cor4 = cor(cdc_RPL_theme4, RPL_theme4))

p18_rpl <- cdc_result18_2 %>% 
  select(overall = cor_all, 
    theme1 = cor1, 
    theme2 = cor2, 
    theme3 = cor3,
    theme4 = cor4) %>% 
  distinct() %>% 
  pivot_longer(1:5, names_to = "theme", values_to = "value") %>% 
  ggplot()+
  geom_col(aes(x=theme, y = value), fill= "#004C54")+
  labs(y = "Corr. coeff.")+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))


p18_rpls <- cdc_result18_2 %>% 
  ggplot(aes(x = cdc_RPL_themes, y = RPL_themes)) +
  geom_point(color = "#004C54")+
  geom_abline(slope = 1, intercept = 0)+
  labs(y = "calculated overall RPL",
    x = "CDC overall RPL")

p18_rpl+p18_rpls+
  plot_annotation(title = "Correlation coeff. for percentile rankings (RPLs)",
    subtitle = "between calculated & CDC SVI in 2018 (zcta level)")&
  theme(plot.title = element_text(size = 14))

Potential over-representation after aggregation

In the plot above, we notice a data point that is close to 1 in calculated result, but less than 0.25 in CDC data. That point corresponds to ZCTA 15207. In CDC data, this ZCTA has the following census tracts:

Code

pa15207 <- zsvi_pa2018 %>% filter(ZCTA == "15207") %>% 
  select(GEOID, ZCTA, RPL_THEMES, RPL_THEME1, RPL_THEME2,
    RPL_THEME3, RPL_THEME4) 

pa15207 %>%  kable()

GEOID	ZCTA	RPL_THEMES	RPL_THEME1	RPL_THEME2	RPL_THEME3	RPL_THEME4
42003151600	15207	0.1821	0.3716	0.0388	0.5829	0.2389
42003151700	15207	0.1931	0.3452	0.0986	0.7330	0.0942
42003310200	15207	0.3420	0.6360	0.5629	0.1507	0.1653
42003310300	15207	0.1966	0.2538	0.2947	0.5303	0.1544
42003488500	15207	0.4132	0.4261	0.4568	0.1379	0.6456
42003562300	15207	0.9781	0.8806	0.9496	0.7079	0.9900
42003562900	15207	0.8141	0.7159	0.9327	0.6773	0.5611
42003980500	15207	-999.0000	-999.0000	0.4865	0.0000	0.0000

But among these census tracts, almost all of them match to other ZCTAs, and the ones end up included in the aggregation are the following two (and the -999 gets further excluded):

Code

pa15207 %>% 
  filter(GEOID%in%ct_zcta_r1_18) %>% kable()

GEOID	ZCTA	RPL_THEMES	RPL_THEME1	RPL_THEME2	RPL_THEME3	RPL_THEME4
42003151600	15207	0.1821	0.3716	0.0388	0.5829	0.2389
42003980500	15207	-999.0000	-999.0000	0.4865	0.0000	0.0000

That’s why the “aggregated” SVI for ZCTA15207 in fact only represents one of the census tracts in the area, leading to a discrepancy with our calculated result.

PA ZCTA-level SVI 2020

Similarly, we could aggregate and compare CDC-released SVI for 2020 at the census tract level with our calculated result at the ZCTA level.

First, we’re joining the CDC SVI table to a new ZCTA to census tract crosswalk updated in 2020. Here is a glance at the table:

Code

ct_zcta_xwalk2020 <- readRDS("../../../data/ct_zcta_xwalk2020.rds")

svi_pa_2020 <- read_csv("../../../cdc_us_svi/cdc_svi_2020_pa_ct.csv") %>% 
  rename(GEOID = FIPS)

zsvi_pa2020 <- svi_pa_2020 %>% 
  left_join(ct_zcta_xwalk2020, by = "GEOID") %>% 
  relocate(ZCTA, .after = GEOID)

zsvi_pa2020 %>% head() %>% select(1:7) %>%  kable()

ST	STATE	ST_ABBR	STCNTY	COUNTY	GEOID	ZCTA
42	Pennsylvania	PA	42001	Adams	42001030101	17019
42	Pennsylvania	PA	42001	Adams	42001030101	17316
42	Pennsylvania	PA	42001	Adams	42001030101	17324
42	Pennsylvania	PA	42001	Adams	42001030101	17372
42	Pennsylvania	PA	42001	Adams	42001030103	17316
42	Pennsylvania	PA	42001	Adams	42001030103	17372

And a brief check for unmatched GEOIDs for reference:

Code

zsvi_pa2020 %>% filter(is.na(ZCTA)) %>% select(1:7) %>% kable()

ST	STATE	ST_ABBR	STCNTY	COUNTY	GEOID	ZCTA
42	Pennsylvania	PA	42055	Franklin	42055010200	NA
42	Pennsylvania	PA	42055	Franklin	42055011302	NA

Secondly, we’ll aggregate the CDC data from ct level to ZCTA level, including the variables and percentile rankings by theme (first 15 rows of each table is shown below).

Aggregated data for all variables:

Code

#select ct:zcta = 1
ct_zcta_r1_20 <- zsvi_pa2020 %>% group_by(GEOID) %>%
  count() %>%
  arrange(n) %>%
  filter(n==1) %>%
  pull(GEOID)

#aggregate by types of variables
var_table20 <- readRDS("../../../data/variable_e_ep_calculation_2020.rds")

var_e20 <- var_table20 %>% 
  filter(theme%in%c(0:4),
    str_detect(.[[1]], "E_")) %>% 
  pull(1)

var_ep20 <- var_table20 %>% 
  filter(theme%in%c(0:4),
    str_detect(.[[1]], "EP_")) %>% 
  pull(1)

cdc20 <- zsvi_pa2020 %>% 
  select(GEOID, ZCTA, all_of(var_e20), all_of(var_ep20)) %>%
  filter(GEOID%in%all_of(ct_zcta_r1_20)) %>% 
  pivot_longer(-c(GEOID,ZCTA), names_to = "var_name", values_to = "value") %>% 
  filter(value >= 0) %>%   # to remove -999 as NA in the table
  group_by(ZCTA, var_name) %>% 
  summarise(sum = sum(value),
    mean = mean(value)) %>%
  mutate(var_zcta = case_when(
    var_name == "E_PCI" ~ mean,
    str_starts(var_name, "E_") ~ sum,
    str_starts(var_name, "EP_") ~ mean
  )) %>%
  ungroup() %>% 
  select(-sum, -mean)

cdc20 %>% head(15) %>%   kable()

ZCTA	var_name	var_zcta
15001	EP_AGE17	15.80
15001	EP_AGE65	23.18
15001	EP_CROWD	0.62
15001	EP_DISABL	22.66
15001	EP_GROUPQ	0.94
15001	EP_HBURD	26.32
15001	EP_LIMENG	0.80
15001	EP_MINRTY	35.50
15001	EP_MOBILE	1.00
15001	EP_MUNIT	7.62
15001	EP_NOHSDP	10.52
15001	EP_NOVEH	16.02
15001	EP_POV150	28.64
15001	EP_SNGPNT	6.92
15001	EP_UNEMP	6.18

Aggregated data for percentile rankings:

Code

cdc20_2 <- zsvi_pa2020 %>% 
  drop_na(ZCTA) %>% 
  filter_all(all_vars(.>=0)) %>% 
  filter(GEOID%in%all_of(ct_zcta_r1_20)) %>% 
  group_by(ZCTA) %>% 
  summarise(
    cdc_RPL_themes = mean(RPL_THEMES), 
    cdc_RPL_theme1 = mean(RPL_THEME1),
    cdc_RPL_theme2 = mean(RPL_THEME2),
    cdc_RPL_theme3 = mean(RPL_THEME3),
    cdc_RPL_theme4 = mean(RPL_THEME4)) %>% 
  mutate(ZCTA = paste(ZCTA)) %>% 
  ungroup()



cdc20_2 %>% head(15) %>%   kable()

ZCTA	cdc_RPL_themes	cdc_RPL_theme1	cdc_RPL_theme2	cdc_RPL_theme3	cdc_RPL_theme4
15001	0.6575200	0.6267600	0.604360	0.7102400	0.6102400
15003	0.5986333	0.6246333	0.491500	0.6259333	0.4723333
15005	0.3201000	0.2105000	0.462100	0.2617000	0.5383000
15009	0.2765000	0.3638750	0.446600	0.2078750	0.2478500
15010	0.6468800	0.5599400	0.600580	0.5336600	0.6685000
15012	0.2837000	0.2684500	0.405100	0.3665000	0.3099000
15017	0.5698000	0.4555667	0.485000	0.4363000	0.7247333
15021	0.5318000	0.6983000	0.549600	0.2400000	0.3299000
15022	0.7126667	0.6582333	0.610200	0.3929333	0.7539667
15024	0.0283000	0.1303000	0.084900	0.1210000	0.0753000
15025	0.6588000	0.6880500	0.538575	0.7208250	0.5044750
15027	0.3743000	0.5193000	0.166900	0.0909000	0.5756000
15033	0.8522000	0.7750500	0.877350	0.7497000	0.7119000
15034	0.6284000	0.7865000	0.502400	0.2556000	0.4559000
15035	0.5636000	0.7136000	0.764700	0.4209000	0.1833000

After adding ZCTA information and aggregating the CDC data, we’re ready to compare them to our calculated result for 2020. Most of the variables turn out quite consistent between our calculation and aggregated CDC data, despite a few peculiar outliers (shown in the correlation plot), possibly due to over-representation of some census tracts.

Code

result2020 <- readRDS("../../../cdc_us_svi/result/pa_zcta_result2020.rds")

result20 <- result2020 %>% 
  select(ZCTA = GEOID, all_of(var_e20), all_of(var_ep20)) %>% 
  pivot_longer(-ZCTA, names_to = "var_name", values_to = "value_hx") %>% 
  filter(value_hx >= 0)

cdc_result20 <- cdc20 %>% 
  left_join(result20, by= c("ZCTA", "var_name")) %>% 
  drop_na() %>% 
  group_by(var_name) %>% 
  mutate(cor = cor(var_zcta, value_hx)) %>% 
  ungroup()

p20_e <- cdc_result20 %>% 
  select(var_name, cor) %>% 
  filter(var_name%in%all_of(var_e20)) %>% 
  distinct() %>% 
  ggplot(aes(x=cor, y = reorder(var_name, cor)))+
  geom_col(fill = "#004C54")+
  xlim(0,1)+
  labs(
    y = "variable name",
    x = "corr. coeff.")

p20_ep <- cdc_result20 %>% 
  select(var_name, cor) %>% 
  filter(var_name%in%all_of(var_ep20)) %>% 
  distinct() %>% 
  ggplot(aes(x=cor, y = reorder(var_name, cor)))+
  geom_col(fill = "#49592a")+
  xlim(0,1)+
  labs(
    y = "variable name",
    x = "corr. coeff.")

p20_e+p20_ep+
  plot_annotation(title = "Correlation coeff. for each variable in SVI calculation",
    subtitle = "between calculated & CDC SVI in 2020 (zcta level)")&
  theme(plot.title = element_text(size = 15))

Code

cdc_result20_2 <- cdc20_2 %>% 
  left_join(
     result2020 %>% 
      select(
        ZCTA = GEOID, 
        RPL_themes,
        RPL_theme1,
        RPL_theme2,
        RPL_theme3, 
        RPL_theme4
      )) %>%
  drop_na() %>% 
  mutate(cor_all = cor(cdc_RPL_themes, RPL_themes),
    cor1 = cor(cdc_RPL_theme1, RPL_theme1),
    cor2 = cor(cdc_RPL_theme2, RPL_theme2),
    cor3 = cor(cdc_RPL_theme3, RPL_theme3),
    cor4 = cor(cdc_RPL_theme4, RPL_theme4))

p20_rpl <- cdc_result20_2 %>% 
  select(overall = cor_all, 
    theme1 = cor1, 
    theme2 = cor2, 
    theme3 = cor3,
    theme4 = cor4) %>% 
  distinct() %>% 
  pivot_longer(1:5, names_to = "theme", values_to = "value") %>% 
  ggplot()+
  geom_col(aes(x=theme, y = value), fill= "#004C54")+
  labs(y = "Corr. coeff.")+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))



p20_rpls <- cdc_result20_2 %>% 
  ggplot(aes(x = cdc_RPL_themes, y = RPL_themes)) +
  geom_point(color = "#004C54")+
  geom_abline(slope = 1, intercept = 0)+
  labs(y = "calculated overall RPL",
    x = "CDC overall RPL")

p20_rpl+p20_rpls+
  plot_annotation(title = "Correlation coeff. for percentile rankings (RPLs)",
    subtitle = "between calculated & CDC SVI in 2020 (zcta level)")&
  theme(plot.title = element_text(size = 15))