SVI Calculation Validation (ZCTA level)

with aggregating CDC CT-level SVI to ZCTA level for PA in 2018 and 2020

Author

Heli Xu

Published

February 8, 2023

Currently, our SVI calculation logic is stored in R functions , and we’re in the process of package development. In responding to the request for PA SVI at the ZCTA level from 2017-2021, we’d like to include this report to provide the rationale and approaches used in validating the results. Any suggestions and feedback are greatly appreciated.

This report will cover the following sections:

SVI calculation and validation

As included in a separate R script (“function collection.R”), get_census_data()(using {tidycensus} under the hood) and get_svi() were used to calculate SVI from census data. The variables required for SVI calculation were either extracted from the dictionary published by CDC when SVI was released (for 2014, 2016, 2018, 2020), or modified from the adjacent years to account for minor changes (for 2015, 2017, 2019, 2021). Details about the variables preparation are included in a separate script (“svi variable prep.R”).

As part of the validation process for our R functions, county and census tract level SVI for certain states are calculated and compared with CDC-released SVI for the same year (if available) or adjacent year, making sure the two versions of SVIs by GEOID are well correlated. For example, our calculated SVI for PA in 2018 at census tract level is highly consistent with the CDC data for 2018, with a correlation coefficient above 0.99 for overall and theme-specific SVI.

Code
library(tidyverse)
library(patchwork)
library(knitr)


result_ct_pa2018 <- readRDS("../../../cdc_us_svi/result/pa_tract_result2018.rds")

svi_pa_2018 <- read_csv("../../../cdc_us_svi/cdc_svi_2018_pa_ct.csv") %>% 
  rename(GEOID = FIPS)

ct_check <- svi_pa_2018 %>% 
  select(
    GEOID, 
    cdc_RPL_themes = RPL_THEMES, 
    cdc_RPL_theme1 = RPL_THEME1,
    cdc_RPL_theme2 = RPL_THEME2,
    cdc_RPL_theme3 = RPL_THEME3,
    cdc_RPL_theme4 = RPL_THEME4
    ) %>%
  mutate(GEOID = paste(GEOID)) %>% 
  left_join(
    result_ct_pa2018 %>% 
      select(
        GEOID, 
        RPL_themes,
        RPL_theme1,
        RPL_theme2,
        RPL_theme3, 
        RPL_theme4
        )
  ) 

# ct_check %>% 
#   filter(is.na(RPL_theme1)) #%>% kable()

ct_corr1 <- ct_check %>% 
  drop_na() %>%   ## remove NA rows
  filter_all(all_vars(.>=0)) %>%
  transmute(overall = cor(cdc_RPL_themes, RPL_themes),
    theme1 = cor(cdc_RPL_theme1, RPL_theme1),
    theme2 = cor(cdc_RPL_theme2, RPL_theme2),
    theme3 = cor(cdc_RPL_theme3, RPL_theme3),
    theme4 = cor(cdc_RPL_theme4, RPL_theme4)) %>% 
  distinct() %>% 
 pivot_longer(1:5, names_to = "theme", values_to = "value") %>% 
  ggplot()+
  geom_col(aes(x=theme, y = value), fill= "#004C54")+
  labs(y = "Corr. coeff.")+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

ct_corr2 <- ct_check %>% 
  drop_na() %>% 
  filter_all(all_vars(.>=0)) %>% 
  ggplot(aes(x = cdc_RPL_themes, y = RPL_themes)) +
  geom_point(color = "#004C54")+
  geom_abline(slope = 1, intercept = 0)+
  labs(y = "calculated overall RPL",
    x = "CDC overall RPL")

ct_corr1+ct_corr2+
  plot_annotation(title = "Correlation coeff. for percentile rankings (RPLs)",
    subtitle = "between calculated & CDC SVI in 2018 (ct level)")&
  theme(plot.title = element_text(size = 14))

PA ZCTA-level SVI 2018

With SVI calculated at the ZCTA level in PA from 2017-2021, here we’d like to check the results against the CDC-published SVI data for PA in the years that they are available (2018 and 2020). Since SVI data released by CDC are either at county or census tract level for each state, we need to add ZCTA information to the SVI tables and aggregate the data from census tract level to ZCTA level. ZCTA to census tract (ct) crosswalk were modified from the relationship files (decennial) from census.gov.

Add ZCTA information to CDC-released SVI

Using ZCTA to census tract crosswalk for 2010, we’re matching each census tract in the 2018 CDC SVI table with a corresponding ZCTA (first few tracts shown below):

Code
ct_zcta_xwalk2010 <- readRDS("../../../data/ct_zcta_xwalk2010.rds")



zsvi_pa2018 <- svi_pa_2018 %>% 
  left_join(ct_zcta_xwalk2010, by = "GEOID") %>% 
  relocate(ZCTA, .after = GEOID)

zsvi_pa2018 %>% head() %>%  select(1:7) %>% kable()
ST STATE ST_ABBR STCNTY COUNTY GEOID ZCTA
42 PENNSYLVANIA PA 42001 Adams 42001030101 17019
42 PENNSYLVANIA PA 42001 Adams 42001030101 17316
42 PENNSYLVANIA PA 42001 Adams 42001030101 17324
42 PENNSYLVANIA PA 42001 Adams 42001030101 17372
42 PENNSYLVANIA PA 42001 Adams 42001030102 17316
42 PENNSYLVANIA PA 42001 Adams 42001030102 17350

We could briefly check if there’s any GEOID (ct) that didn’t get matched with a ZCTA code (none):

Code
zsvi_pa2018 %>% filter(is.na(ZCTA)) 
ST STATE ST_ABBR STCNTY COUNTY GEOID ZCTA LOCATION AREA_SQMI E_TOTPOP M_TOTPOP E_HU M_HU E_HH M_HH E_POV M_POV E_UNEMP M_UNEMP E_PCI M_PCI E_NOHSDP M_NOHSDP E_AGE65 M_AGE65 E_AGE17 M_AGE17 E_DISABL M_DISABL E_SNGPNT M_SNGPNT E_MINRTY M_MINRTY E_LIMENG M_LIMENG E_MUNIT M_MUNIT E_MOBILE M_MOBILE E_CROWD M_CROWD E_NOVEH M_NOVEH E_GROUPQ M_GROUPQ EP_POV MP_POV EP_UNEMP MP_UNEMP EP_PCI MP_PCI EP_NOHSDP MP_NOHSDP EP_AGE65 MP_AGE65 EP_AGE17 MP_AGE17 EP_DISABL MP_DISABL EP_SNGPNT MP_SNGPNT EP_MINRTY MP_MINRTY EP_LIMENG MP_LIMENG EP_MUNIT MP_MUNIT EP_MOBILE MP_MOBILE EP_CROWD MP_CROWD EP_NOVEH MP_NOVEH EP_GROUPQ MP_GROUPQ EPL_POV EPL_UNEMP EPL_PCI EPL_NOHSDP SPL_THEME1 RPL_THEME1 EPL_AGE65 EPL_AGE17 EPL_DISABL EPL_SNGPNT SPL_THEME2 RPL_THEME2 EPL_MINRTY EPL_LIMENG SPL_THEME3 RPL_THEME3 EPL_MUNIT EPL_MOBILE EPL_CROWD EPL_NOVEH EPL_GROUPQ SPL_THEME4 RPL_THEME4 SPL_THEMES RPL_THEMES F_POV F_UNEMP F_PCI F_NOHSDP F_THEME1 F_AGE65 F_AGE17 F_DISABL F_SNGPNT F_THEME2 F_MINRTY F_LIMENG F_THEME3 F_MUNIT F_MOBILE F_CROWD F_NOVEH F_GROUPQ F_THEME4 F_TOTAL E_UNINSUR M_UNINSUR EP_UNINSUR MP_UNINSUR E_DAYPOP

Aggregating ct data to ZCTA level

Generally speaking, ZCTAs represent larger areas than census tracts, but census tracts do not necessarily nest within ZCTAs – sometimes one census tract could correspond to multiple ZCTAs (while one or more of those ZCTAs also show up in other census tracts). In such cases, we might be aggregating values of a larger area to represent a smaller area. Considering the complicated relationship between ZCTAs and census tracts, here we’re subsetting the ct-specific GEOIDs with only one matching ZCTA (different ct can be matched to the same ZCTA).

The potential caveat is that we might be over-representing some census tracts for a ZCTA when it’s supposed to include multiple census tracts, but some of the census tracts are excluded because they’re also matched to other ZCTAs. An example will be provided later in this document.

Variables (“E_xx” and “EP_xx”)

For individual variables, we’re taking sum for the “E_xx” values by the ZCTA as they mostly represent counts, whereas for “EP_xx” values, we’re taking the mean as they represent a percentage of the total. Before 2018, the exception is “E_PCI”, which is the same value as “EP_PCI”, representing per capita income, and we’re taking the mean of that. First 15 rows of the aggregated table is shown below:

Code
#select ct:zcta = 1
ct_zcta_r1_18 <- zsvi_pa2018 %>% group_by(GEOID) %>%
  count() %>%
  arrange(n) %>%
  filter(n==1) %>%
  pull(GEOID)

#aggregate by types of variables
var_table18 <- readRDS("../../../data/variable_e_ep_calculation_2018.rds")

var_e18 <- var_table18 %>% 
  filter(theme%in%c(0:4),
    str_detect(.[[1]], "E_")) %>% 
  pull(1)

var_ep18 <- var_table18 %>% 
  filter(theme%in%c(0:4),
    str_detect(.[[1]], "EP_")) %>% 
  pull(1)

cdc18 <- zsvi_pa2018 %>% 
  select(GEOID, ZCTA, all_of(var_e18), all_of(var_ep18)) %>%
  filter(GEOID%in%all_of(ct_zcta_r1_18)) %>% 
  pivot_longer(-c(GEOID,ZCTA), names_to = "var_name", values_to = "value") %>% 
  filter(value >= 0) %>%   # to remove -999 as NA in the table
  group_by(ZCTA, var_name) %>% 
  summarise(sum = sum(value),
    mean = mean(value)) %>%
  mutate(var_zcta = case_when(
    var_name == "E_PCI" ~ mean,
    str_starts(var_name, "E_") ~ sum,
    str_starts(var_name, "EP_") ~ mean
  )) %>%
  ungroup() %>% 
  select(-sum, -mean)

cdc18 %>% head(15) %>% kable()
ZCTA var_name var_zcta
15001 EP_AGE17 17.716667
15001 EP_AGE65 20.466667
15001 EP_CROWD 0.850000
15001 EP_DISABL 20.466667
15001 EP_GROUPQ 2.416667
15001 EP_LIMENG 0.600000
15001 EP_MINRTY 32.383333
15001 EP_MOBILE 2.250000
15001 EP_MUNIT 6.100000
15001 EP_NOHSDP 7.816667
15001 EP_NOVEH 14.050000
15001 EP_PCI 24178.666667
15001 EP_POV 23.600000
15001 EP_SNGPNT 10.800000
15001 EP_UNEMP 7.500000

Percentile ranking (“RPL_xx”) by theme

For percentile ranking, for now we’re taking the mean by the ZCTA and we’ll get a table with first few rows looking like this: (I tried weighted by population and it didn’t seem to help too much with correlation.)

Code
cdc18_2 <- zsvi_pa2018 %>% 
  drop_na(ZCTA) %>% 
  filter_all(all_vars(.>=0)) %>% 
  filter(GEOID%in%all_of(ct_zcta_r1_18)) %>% 
  group_by(ZCTA) %>% 
  summarise(
    cdc_RPL_themes = mean(RPL_THEMES), 
    cdc_RPL_theme1 = mean(RPL_THEME1),
    cdc_RPL_theme2 = mean(RPL_THEME2),
    cdc_RPL_theme3 = mean(RPL_THEME3),
    cdc_RPL_theme4 = mean(RPL_THEME4)) %>% 
  mutate(ZCTA = paste(ZCTA)) %>% 
  ungroup()

cdc18_2 %>% head(10) %>% kable()
ZCTA cdc_RPL_themes cdc_RPL_theme1 cdc_RPL_theme2 cdc_RPL_theme3 cdc_RPL_theme4
15001 0.6241000 0.6000500 0.6283000 0.5522167 0.5931333
15003 0.7153667 0.6803333 0.6176667 0.5735667 0.6555000
15005 0.3658000 0.3362000 0.5917000 0.1410000 0.5667000
15009 0.2006250 0.2957250 0.5000000 0.1411000 0.2125000
15010 0.7475750 0.7720000 0.6890250 0.5234500 0.6250750
15012 0.4354000 0.5984000 0.2847000 0.4928000 0.2996000
15017 0.4439667 0.4159333 0.5477667 0.2490000 0.5918000
15021 0.5301000 0.7184000 0.7387000 0.2730000 0.2379000
15022 0.7778667 0.6937000 0.9548667 0.2902000 0.7480667
15024 0.1514000 0.2757000 0.3307000 0.0088000 0.3312000

Correlation: aggregated CDC data vs. calculated result

To compare the aggregated CDC SVI data with our calculated SVI, we’re joining the CDC data with our result by each ZCTA, and we could check the correlation between the two versions of SVI in all ZCTAs.

For individual variables, below shows the correlation coefficient for each variable:

Code
result2018 <- readRDS("../../../cdc_us_svi/result/pa_zcta_result2018.rds") 

result18 <- result2018 %>% 
  select(ZCTA = GEOID, all_of(var_e18), all_of(var_ep18)) %>% 
  pivot_longer(-ZCTA, names_to = "var_name", values_to = "value_hx") %>% 
  filter(value_hx >= 0)

cdc_result18 <- cdc18 %>% 
  left_join(result18, by= c("ZCTA", "var_name")) %>% 
  drop_na() %>% 
  group_by(var_name) %>% 
  mutate(cor = cor(var_zcta, value_hx)) %>% 
  ungroup()

p18_e <- cdc_result18 %>% 
  select(var_name, cor) %>% 
  filter(var_name%in%all_of(var_e18)) %>% 
  distinct() %>% 
  ggplot(aes(x=cor, y = reorder(var_name, cor)))+
  geom_col(fill = "#004C54")+
  xlim(0,1)+
  labs(
    y = "variable name",
    x = "corr. coeff.")

p18_ep <- cdc_result18 %>% 
  select(var_name, cor) %>% 
  filter(var_name%in%all_of(var_ep18)) %>% 
  distinct() %>% 
  ggplot(aes(x=cor, y = reorder(var_name, cor)))+
  geom_col(fill = "#49592a")+
  xlim(0,1)+
  labs(y = "variable name",
    x = "corr. coeff.")

p18_e+p18_ep+
  plot_annotation(title = "Correlation coeff. for each variable in SVI calculation",
    subtitle = "between calculated & CDC SVI in 2018 (zcta level)")&
  theme(plot.title = element_text(size = 14))

For RPLs, we’ll compare RPL for each theme and overall RPL for all themes. Here is a plot for the overall RPL for all themes of each ZCTA:

Code
cdc_result18_2 <- cdc18_2 %>% 
  left_join(
     result2018 %>% 
      select(
        ZCTA = GEOID, 
        RPL_themes,
        RPL_theme1,
        RPL_theme2,
        RPL_theme3, 
        RPL_theme4
      )) %>%
  drop_na() %>% 
  mutate(cor_all = cor(cdc_RPL_themes, RPL_themes),
    cor1 = cor(cdc_RPL_theme1, RPL_theme1),
    cor2 = cor(cdc_RPL_theme2, RPL_theme2),
    cor3 = cor(cdc_RPL_theme3, RPL_theme3),
    cor4 = cor(cdc_RPL_theme4, RPL_theme4))

p18_rpl <- cdc_result18_2 %>% 
  select(overall = cor_all, 
    theme1 = cor1, 
    theme2 = cor2, 
    theme3 = cor3,
    theme4 = cor4) %>% 
  distinct() %>% 
  pivot_longer(1:5, names_to = "theme", values_to = "value") %>% 
  ggplot()+
  geom_col(aes(x=theme, y = value), fill= "#004C54")+
  labs(y = "Corr. coeff.")+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))


p18_rpls <- cdc_result18_2 %>% 
  ggplot(aes(x = cdc_RPL_themes, y = RPL_themes)) +
  geom_point(color = "#004C54")+
  geom_abline(slope = 1, intercept = 0)+
  labs(y = "calculated overall RPL",
    x = "CDC overall RPL")

p18_rpl+p18_rpls+
  plot_annotation(title = "Correlation coeff. for percentile rankings (RPLs)",
    subtitle = "between calculated & CDC SVI in 2018 (zcta level)")&
  theme(plot.title = element_text(size = 14))

Potential over-representation after aggregation

In the plot above, we notice a data point that is close to 1 in calculated result, but less than 0.25 in CDC data. That point corresponds to ZCTA 15207. In CDC data, this ZCTA has the following census tracts:

Code
pa15207 <- zsvi_pa2018 %>% filter(ZCTA == "15207") %>% 
  select(GEOID, ZCTA, RPL_THEMES, RPL_THEME1, RPL_THEME2,
    RPL_THEME3, RPL_THEME4) 

pa15207 %>%  kable()
GEOID ZCTA RPL_THEMES RPL_THEME1 RPL_THEME2 RPL_THEME3 RPL_THEME4
42003151600 15207 0.1821 0.3716 0.0388 0.5829 0.2389
42003151700 15207 0.1931 0.3452 0.0986 0.7330 0.0942
42003310200 15207 0.3420 0.6360 0.5629 0.1507 0.1653
42003310300 15207 0.1966 0.2538 0.2947 0.5303 0.1544
42003488500 15207 0.4132 0.4261 0.4568 0.1379 0.6456
42003562300 15207 0.9781 0.8806 0.9496 0.7079 0.9900
42003562900 15207 0.8141 0.7159 0.9327 0.6773 0.5611
42003980500 15207 -999.0000 -999.0000 0.4865 0.0000 0.0000

But among these census tracts, almost all of them match to other ZCTAs, and the ones end up included in the aggregation are the following two (and the -999 gets further excluded):

Code
pa15207 %>% 
  filter(GEOID%in%ct_zcta_r1_18) %>% kable()
GEOID ZCTA RPL_THEMES RPL_THEME1 RPL_THEME2 RPL_THEME3 RPL_THEME4
42003151600 15207 0.1821 0.3716 0.0388 0.5829 0.2389
42003980500 15207 -999.0000 -999.0000 0.4865 0.0000 0.0000

That’s why the “aggregated” SVI for ZCTA15207 in fact only represents one of the census tracts in the area, leading to a discrepancy with our calculated result.

PA ZCTA-level SVI 2020

Similarly, we could aggregate and compare CDC-released SVI for 2020 at the census tract level with our calculated result at the ZCTA level.

First, we’re joining the CDC SVI table to a new ZCTA to census tract crosswalk updated in 2020. Here is a glance at the table:

Code
ct_zcta_xwalk2020 <- readRDS("../../../data/ct_zcta_xwalk2020.rds")

svi_pa_2020 <- read_csv("../../../cdc_us_svi/cdc_svi_2020_pa_ct.csv") %>% 
  rename(GEOID = FIPS)

zsvi_pa2020 <- svi_pa_2020 %>% 
  left_join(ct_zcta_xwalk2020, by = "GEOID") %>% 
  relocate(ZCTA, .after = GEOID)

zsvi_pa2020 %>% head() %>% select(1:7) %>%  kable()
ST STATE ST_ABBR STCNTY COUNTY GEOID ZCTA
42 Pennsylvania PA 42001 Adams 42001030101 17019
42 Pennsylvania PA 42001 Adams 42001030101 17316
42 Pennsylvania PA 42001 Adams 42001030101 17324
42 Pennsylvania PA 42001 Adams 42001030101 17372
42 Pennsylvania PA 42001 Adams 42001030103 17316
42 Pennsylvania PA 42001 Adams 42001030103 17372

And a brief check for unmatched GEOIDs for reference:

Code
zsvi_pa2020 %>% filter(is.na(ZCTA)) %>% select(1:7) %>% kable()
ST STATE ST_ABBR STCNTY COUNTY GEOID ZCTA
42 Pennsylvania PA 42055 Franklin 42055010200 NA
42 Pennsylvania PA 42055 Franklin 42055011302 NA

Secondly, we’ll aggregate the CDC data from ct level to ZCTA level, including the variables and percentile rankings by theme (first 15 rows of each table is shown below).

Aggregated data for all variables:

Code
#select ct:zcta = 1
ct_zcta_r1_20 <- zsvi_pa2020 %>% group_by(GEOID) %>%
  count() %>%
  arrange(n) %>%
  filter(n==1) %>%
  pull(GEOID)

#aggregate by types of variables
var_table20 <- readRDS("../../../data/variable_e_ep_calculation_2020.rds")

var_e20 <- var_table20 %>% 
  filter(theme%in%c(0:4),
    str_detect(.[[1]], "E_")) %>% 
  pull(1)

var_ep20 <- var_table20 %>% 
  filter(theme%in%c(0:4),
    str_detect(.[[1]], "EP_")) %>% 
  pull(1)

cdc20 <- zsvi_pa2020 %>% 
  select(GEOID, ZCTA, all_of(var_e20), all_of(var_ep20)) %>%
  filter(GEOID%in%all_of(ct_zcta_r1_20)) %>% 
  pivot_longer(-c(GEOID,ZCTA), names_to = "var_name", values_to = "value") %>% 
  filter(value >= 0) %>%   # to remove -999 as NA in the table
  group_by(ZCTA, var_name) %>% 
  summarise(sum = sum(value),
    mean = mean(value)) %>%
  mutate(var_zcta = case_when(
    var_name == "E_PCI" ~ mean,
    str_starts(var_name, "E_") ~ sum,
    str_starts(var_name, "EP_") ~ mean
  )) %>%
  ungroup() %>% 
  select(-sum, -mean)

cdc20 %>% head(15) %>%   kable()
ZCTA var_name var_zcta
15001 EP_AGE17 15.80
15001 EP_AGE65 23.18
15001 EP_CROWD 0.62
15001 EP_DISABL 22.66
15001 EP_GROUPQ 0.94
15001 EP_HBURD 26.32
15001 EP_LIMENG 0.80
15001 EP_MINRTY 35.50
15001 EP_MOBILE 1.00
15001 EP_MUNIT 7.62
15001 EP_NOHSDP 10.52
15001 EP_NOVEH 16.02
15001 EP_POV150 28.64
15001 EP_SNGPNT 6.92
15001 EP_UNEMP 6.18

Aggregated data for percentile rankings:

Code
cdc20_2 <- zsvi_pa2020 %>% 
  drop_na(ZCTA) %>% 
  filter_all(all_vars(.>=0)) %>% 
  filter(GEOID%in%all_of(ct_zcta_r1_20)) %>% 
  group_by(ZCTA) %>% 
  summarise(
    cdc_RPL_themes = mean(RPL_THEMES), 
    cdc_RPL_theme1 = mean(RPL_THEME1),
    cdc_RPL_theme2 = mean(RPL_THEME2),
    cdc_RPL_theme3 = mean(RPL_THEME3),
    cdc_RPL_theme4 = mean(RPL_THEME4)) %>% 
  mutate(ZCTA = paste(ZCTA)) %>% 
  ungroup()



cdc20_2 %>% head(15) %>%   kable()
ZCTA cdc_RPL_themes cdc_RPL_theme1 cdc_RPL_theme2 cdc_RPL_theme3 cdc_RPL_theme4
15001 0.6575200 0.6267600 0.604360 0.7102400 0.6102400
15003 0.5986333 0.6246333 0.491500 0.6259333 0.4723333
15005 0.3201000 0.2105000 0.462100 0.2617000 0.5383000
15009 0.2765000 0.3638750 0.446600 0.2078750 0.2478500
15010 0.6468800 0.5599400 0.600580 0.5336600 0.6685000
15012 0.2837000 0.2684500 0.405100 0.3665000 0.3099000
15017 0.5698000 0.4555667 0.485000 0.4363000 0.7247333
15021 0.5318000 0.6983000 0.549600 0.2400000 0.3299000
15022 0.7126667 0.6582333 0.610200 0.3929333 0.7539667
15024 0.0283000 0.1303000 0.084900 0.1210000 0.0753000
15025 0.6588000 0.6880500 0.538575 0.7208250 0.5044750
15027 0.3743000 0.5193000 0.166900 0.0909000 0.5756000
15033 0.8522000 0.7750500 0.877350 0.7497000 0.7119000
15034 0.6284000 0.7865000 0.502400 0.2556000 0.4559000
15035 0.5636000 0.7136000 0.764700 0.4209000 0.1833000

After adding ZCTA information and aggregating the CDC data, we’re ready to compare them to our calculated result for 2020. Most of the variables turn out quite consistent between our calculation and aggregated CDC data, despite a few peculiar outliers (shown in the correlation plot), possibly due to over-representation of some census tracts.

Code
result2020 <- readRDS("../../../cdc_us_svi/result/pa_zcta_result2020.rds")

result20 <- result2020 %>% 
  select(ZCTA = GEOID, all_of(var_e20), all_of(var_ep20)) %>% 
  pivot_longer(-ZCTA, names_to = "var_name", values_to = "value_hx") %>% 
  filter(value_hx >= 0)

cdc_result20 <- cdc20 %>% 
  left_join(result20, by= c("ZCTA", "var_name")) %>% 
  drop_na() %>% 
  group_by(var_name) %>% 
  mutate(cor = cor(var_zcta, value_hx)) %>% 
  ungroup()

p20_e <- cdc_result20 %>% 
  select(var_name, cor) %>% 
  filter(var_name%in%all_of(var_e20)) %>% 
  distinct() %>% 
  ggplot(aes(x=cor, y = reorder(var_name, cor)))+
  geom_col(fill = "#004C54")+
  xlim(0,1)+
  labs(
    y = "variable name",
    x = "corr. coeff.")

p20_ep <- cdc_result20 %>% 
  select(var_name, cor) %>% 
  filter(var_name%in%all_of(var_ep20)) %>% 
  distinct() %>% 
  ggplot(aes(x=cor, y = reorder(var_name, cor)))+
  geom_col(fill = "#49592a")+
  xlim(0,1)+
  labs(
    y = "variable name",
    x = "corr. coeff.")

p20_e+p20_ep+
  plot_annotation(title = "Correlation coeff. for each variable in SVI calculation",
    subtitle = "between calculated & CDC SVI in 2020 (zcta level)")&
  theme(plot.title = element_text(size = 15))

Code
cdc_result20_2 <- cdc20_2 %>% 
  left_join(
     result2020 %>% 
      select(
        ZCTA = GEOID, 
        RPL_themes,
        RPL_theme1,
        RPL_theme2,
        RPL_theme3, 
        RPL_theme4
      )) %>%
  drop_na() %>% 
  mutate(cor_all = cor(cdc_RPL_themes, RPL_themes),
    cor1 = cor(cdc_RPL_theme1, RPL_theme1),
    cor2 = cor(cdc_RPL_theme2, RPL_theme2),
    cor3 = cor(cdc_RPL_theme3, RPL_theme3),
    cor4 = cor(cdc_RPL_theme4, RPL_theme4))

p20_rpl <- cdc_result20_2 %>% 
  select(overall = cor_all, 
    theme1 = cor1, 
    theme2 = cor2, 
    theme3 = cor3,
    theme4 = cor4) %>% 
  distinct() %>% 
  pivot_longer(1:5, names_to = "theme", values_to = "value") %>% 
  ggplot()+
  geom_col(aes(x=theme, y = value), fill= "#004C54")+
  labs(y = "Corr. coeff.")+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))



p20_rpls <- cdc_result20_2 %>% 
  ggplot(aes(x = cdc_RPL_themes, y = RPL_themes)) +
  geom_point(color = "#004C54")+
  geom_abline(slope = 1, intercept = 0)+
  labs(y = "calculated overall RPL",
    x = "CDC overall RPL")

p20_rpl+p20_rpls+
  plot_annotation(title = "Correlation coeff. for percentile rankings (RPLs)",
    subtitle = "between calculated & CDC SVI in 2020 (zcta level)")&
  theme(plot.title = element_text(size = 15))