SVI Calculation - SVI calculation validation: why are they not identical?

In the SVI calculation validation process, while we we see very strong correlations between our calculated result and CDC-released SVI of the same year at the county and census tract level (detailed in previous post), we do observe minor differences. Here, we explore the reasons why our calculation is not identical with CDC SVI.

For example, here is a scatter plot showing the correlation of the two versions of CT-level SVI for PA in 2018 (the tract with largest difference between CDC and calculated RPL highlighted in red):

Code

library(tidyverse)
library(patchwork)
library(knitr)


result_ct_pa2018 <- readRDS("../../../cdc_us_svi/result/pa_tract_result2018.rds")

svi_pa_2018 <- read_csv("../../../cdc_us_svi/cdc_svi_2018_pa_ct.csv") %>% 
  rename(GEOID = FIPS)

#bad_tract = '42071010900'

#make a function for joining CDC SVI and our results
join_table <- function(cdc, diy){
  cdc %>% 
    select(
      GEOID,
      cdc_RPL_themes = RPL_THEMES,
      cdc_RPL_theme1 = RPL_THEME1,
      cdc_RPL_theme2 = RPL_THEME2,
      cdc_RPL_theme3 = RPL_THEME3,
      cdc_RPL_theme4 = RPL_THEME4
    ) %>%
    mutate(GEOID = paste(GEOID)) %>%
    left_join(
      diy %>%
        select(
          GEOID,
          RPL_themes,
          RPL_theme1,
          RPL_theme2,
          RPL_theme3,
          RPL_theme4
        )
    ) 
}

ct_check18 <-  join_table(svi_pa_2018, result_ct_pa2018)

ct_check18 %>% 
  drop_na() %>% 
  filter_all(all_vars(.>=0)) %>% 
  ggplot(aes(x = cdc_RPL_themes, y = RPL_themes)) +
  geom_point(color = "#004C54")+
  geom_point(data = ct_check18 %>% filter(GEOID == "42071010900"),
    aes(x = cdc_RPL_themes, y = RPL_themes),
    color = 'red')+
  geom_abline(slope = 1, intercept = 0)+
  labs(title = "CDC vs. calculated CT-level SVI for PA in 2018",
    subtitle = "Comparison of overall percentile ranking (RPLs)",
    y = "calculated overall RPL",
    x = "CDC overall RPL")+
  theme(plot.title = element_text(size= 15))

There are clearly some dots wandering (a bit) away from the line. If we’re retrieving data at the same geographic level and following the same calculation procedure as CDC does, why would there be differences at all?

Minor differences at CT level

Using the same example as the plot above, we’ll specifically look at the difference (absolute value) between the two versions of overall RPLs for all tracts (GEOIDs). Arranging the difference in descending order, we can get a glance at the tracts with relatively high discrepancy between the RPLs from CDC SVI and ours (first 15 rows are shown below).

Code

ct_diff18 <- ct_check18 %>%
  filter_all(all_vars(.>=0)) %>% 
  select(GEOID, cdc_RPL_themes, RPL_themes) %>% 
  mutate(diff_all = abs(cdc_RPL_themes- RPL_themes)) %>% 
  arrange(desc(diff_all))

ct_diff18 %>% head(15) %>% kable()

GEOID	cdc_RPL_themes	RPL_themes	diff_all
42071010900	0.3483	0.4098	0.0615
42011012300	0.5824	0.6330	0.0506
42021011100	0.3564	0.4058	0.0494
42071010701	0.2624	0.3117	0.0493
42003488600	0.3160	0.3623	0.0463
42003523300	0.5320	0.5771	0.0451
42065950600	0.5674	0.6124	0.0450
42007605100	0.4953	0.5402	0.0449
42107002900	0.2461	0.2904	0.0443
42107003000	0.2665	0.3104	0.0439
42009960900	0.5755	0.6186	0.0431
42057960200	0.5125	0.5549	0.0424
42117950400	0.5944	0.6368	0.0424
42089300201	0.2433	0.2854	0.0421
42007605800	0.2426	0.2845	0.0419

Zooming in on the GEOID with the largest difference between the two versions of RPLs, 42071010900, we could extract information on individual variable from our calculated SVI and CDC SVI. (In the table below, calculated SVI is denoted as “hSVI”, and CDC SVI is denoted as “cSVI;”diff” column contains the difference between hSVI and cSVI and arranged in descending order.)

Code

options(scipen = 9999)

diy18 <- result_ct_pa2018 %>% filter(GEOID == "42071010900") %>% 
  select(-NAME,
    SPL_THEMES = SPL_themes,
    SPL_THEME1 = SPL_theme1,
    SPL_THEME2 = SPL_theme2,
    SPL_THEME3 = SPL_theme3,
    SPL_THEME4 = SPL_theme4,
    RPL_THEMES = RPL_themes,
    RPL_THEME1 = RPL_theme1,
    RPL_THEME2 = RPL_theme2,
    RPL_THEME3 = RPL_theme3,
    RPL_THEME4 = RPL_theme4) %>% 
  pivot_longer(-1, names_to = "var_name", values_to = "hSVI") %>% 
  mutate(hSVI = round(hSVI, 4))


cdc18 <- svi_pa_2018 %>% filter(GEOID == "42071010900") %>% 
  select(-(1:5), -LOCATION, -AREA_SQMI) %>% 
  mutate(GEOID = paste(GEOID)) %>% 
  pivot_longer(-1, names_to = "var_name", values_to = "cSVI")

diff18 <- diy18 %>% 
  select(-GEOID) %>% 
  left_join(cdc18, by = "var_name") %>% 
  relocate(GEOID, .before = var_name) %>% 
  mutate(diff = abs(hSVI-cSVI)) %>% 
  arrange(desc(diff))

diff18 %>% kable()

GEOID	var_name	hSVI	cSVI	diff
42071010900	SPL_THEMES	6.5903	6.2476	0.3427
42071010900	SPL_THEME4	2.5540	2.2239	0.3301
42071010900	EPL_GROUPQ	0.3158	0.0000	0.3158
42071010900	RPL_THEME4	0.6143	0.4693	0.1450
42071010900	RPL_THEMES	0.4098	0.3483	0.0615
42071010900	EP_MUNIT	5.8613	5.9000	0.0387
42071010900	EP_SNGPNT	6.2356	6.2000	0.0356
42071010900	EP_LIMENG	0.5653	0.6000	0.0347
42071010900	EP_AGE17	20.4663	20.5000	0.0337
42071010900	EP_MINRTY	16.7283	16.7000	0.0283
42071010900	EP_GROUPQ	0.0247	0.0000	0.0247
42071010900	EP_CROWD	1.3218	1.3000	0.0218
42071010900	EPL_CROWD	0.6293	0.6160	0.0133
42071010900	SPL_THEME3	1.0831	1.0760	0.0071
42071010900	EPL_LIMENG	0.4959	0.4900	0.0059
42071010900	SPL_THEME2	1.4882	1.4827	0.0055
42071010900	EPL_SNGPNT	0.4236	0.4190	0.0046
42071010900	RPL_THEME3	0.5707	0.5741	0.0034
42071010900	EPL_MINRTY	0.5872	0.5860	0.0012
42071010900	RPL_THEME2	0.1882	0.1871	0.0011
42071010900	EPL_MUNIT	0.5947	0.5938	0.0009
42071010900	EPL_AGE17	0.5003	0.4994	0.0009
42071010900	RPL_THEME1	0.3377	0.3374	0.0003
42071010900	EPL_MOBILE	0.6961	0.6960	0.0001
42071010900	E_TOTPOP	8106.0000	8106.0000	0.0000
42071010900	E_HU	3634.0000	3634.0000	0.0000
42071010900	E_HH	3480.0000	3480.0000	0.0000
42071010900	E_POV	541.0000	541.0000	0.0000
42071010900	E_UNEMP	200.0000	200.0000	0.0000
42071010900	E_PCI	32199.0000	32199.0000	0.0000
42071010900	E_NOHSDP	426.0000	426.0000	0.0000
42071010900	E_AGE65	1132.0000	1132.0000	0.0000
42071010900	E_AGE17	1659.0000	1659.0000	0.0000
42071010900	E_DISABL	933.0000	933.0000	0.0000
42071010900	E_SNGPNT	217.0000	217.0000	0.0000
42071010900	E_MINRTY	1356.0000	1356.0000	0.0000
42071010900	E_LIMENG	43.0000	43.0000	0.0000
42071010900	E_MUNIT	213.0000	213.0000	0.0000
42071010900	E_MOBILE	97.0000	97.0000	0.0000
42071010900	E_CROWD	46.0000	46.0000	0.0000
42071010900	E_NOVEH	150.0000	150.0000	0.0000
42071010900	E_GROUPQ	2.0000	2.0000	0.0000
42071010900	EP_POV	6.8000	6.8000	0.0000
42071010900	EP_UNEMP	4.0000	4.0000	0.0000
42071010900	EP_PCI	32199.0000	32199.0000	0.0000
42071010900	EP_NOHSDP	7.3000	7.3000	0.0000
42071010900	EP_AGE65	14.0000	14.0000	0.0000
42071010900	EP_DISABL	11.5000	11.5000	0.0000
42071010900	EP_MOBILE	2.7000	2.7000	0.0000
42071010900	EP_NOVEH	4.3000	4.3000	0.0000
42071010900	EPL_POV	0.3207	0.3207	0.0000
42071010900	EPL_UNEMP	0.3312	0.3312	0.0000
42071010900	EPL_PCI	0.3994	0.3994	0.0000
42071010900	EPL_NOHSDP	0.4137	0.4137	0.0000
42071010900	EPL_AGE65	0.2699	0.2699	0.0000
42071010900	EPL_DISABL	0.2944	0.2944	0.0000
42071010900	EPL_NOVEH	0.3181	0.3181	0.0000
42071010900	SPL_THEME1	1.4650	1.4650	0.0000

From the side-by-side comparison, the minor discrepancy in RPLs (and SPLs, EPLs) is most likely due to different number of decimal places in some EP_variables. While cSVI is using one decimal places for the EP_variables (when calculation is required, that is, when the percentage information cannot be retrieved directly from census data), hSVI does not specify decimal places and therefore shows more digits after decimal point.

Minor differences at CTY level

Additionally, we could take a closer look at the minor difference at the county level, using the SVIs for PA in 2020 this time. Below shows a correlation scatter plot of overall RPLs between CDC and calculated version, with the most different data points (tied) in red.

Code

result2020_co <- readRDS("../../../cdc_us_svi/result/pa_co_result2020.rds")

svi_pa_2020co <- read_csv("../../../download/2020svi_pa_co_cdc.csv") %>% 
  rename(GEOID = FIPS)

co_check20 <- join_table(svi_pa_2020co, result2020_co)

co_check20 %>% 
  drop_na() %>% 
  filter_all(all_vars(.>=0)) %>% 
  ggplot(aes(x = cdc_RPL_themes, y = RPL_themes)) +
  geom_point(color = "#191970")+
  geom_point(data = co_check20 %>% filter(GEOID%in%c("42067", "42061")),
    aes(x = cdc_RPL_themes, y = RPL_themes),
    color = 'red')+
  geom_abline(slope = 1, intercept = 0)+
  labs(title = "CDC vs. calculated CTY-level SVI for PA in 2020",
    subtitle = "Comparison of overall percentile ranking (RPLs)",
    y = "calculated overall RPL",
    x = "CDC overall RPL")+
  theme(plot.title = element_text(size= 15))

Also, the top 15 counties that have the largest difference between CDC and our calculated RPLs are included below:

Code

co_diff20 <- co_check20 %>% 
  filter_all(all_vars(.>=0)) %>% 
  select(GEOID, cdc_RPL_themes, RPL_themes) %>% 
  mutate(diff_all = abs(cdc_RPL_themes- RPL_themes)) %>% 
  arrange(desc(diff_all))

co_diff20 %>% head(15) %>% kable()

GEOID	cdc_RPL_themes	RPL_themes	diff_all
42061	0.5000	0.5455	0.0455
42067	0.5455	0.5000	0.0455
42015	0.6667	0.6364	0.0303
42037	0.1515	0.1212	0.0303
42059	0.4091	0.3788	0.0303
42087	0.7121	0.7424	0.0303
42107	0.6364	0.6667	0.0303
42055	0.3939	0.4091	0.0152
42003	0.2879	0.2727	0.0152
42033	0.8939	0.9091	0.0152
42039	0.8030	0.8182	0.0152
42063	0.8182	0.8030	0.0152
42075	0.9091	0.8939	0.0152
42079	0.9394	0.9242	0.0152
42083	0.5909	0.6061	0.0152

Taking county GEOID 42061 as an example, we could compare the values for all variables from SVI (denotation same as above; “diff” column is arranged in descending order):

Code

diy20 <- result2020_co %>% filter(GEOID == "42061") %>% 
  select(-NAME,
    SPL_THEMES = SPL_themes,
    SPL_THEME1 = SPL_theme1,
    SPL_THEME2 = SPL_theme2,
    SPL_THEME3 = SPL_theme3,
    SPL_THEME4 = SPL_theme4,
    RPL_THEMES = RPL_themes,
    RPL_THEME1 = RPL_theme1,
    RPL_THEME2 = RPL_theme2,
    RPL_THEME3 = RPL_theme3,
    RPL_THEME4 = RPL_theme4) %>% 
  pivot_longer(-1, names_to = "var_name", values_to = "hSVI") %>% 
  mutate(hSVI = round(hSVI, 4))


cdc20 <- svi_pa_2020co %>% filter(GEOID == "42061") %>% 
  select(-(1:5), -LOCATION, -AREA_SQMI) %>% 
  mutate(GEOID = paste(GEOID)) %>% 
  pivot_longer(-1, names_to = "var_name", values_to = "cSVI")

diff20 <- diy20 %>% 
  select(-GEOID) %>% 
  left_join(cdc20, by = "var_name") %>% 
  relocate(GEOID, .before = var_name) %>% 
  mutate(diff = abs(hSVI-cSVI)) %>% 
  arrange(desc(diff))

diff20 %>% kable()

GEOID	var_name	hSVI	cSVI	diff
42061	SPL_THEMES	8.0607	7.8486	0.2121
42061	SPL_THEME2	2.5456	2.4395	0.1061
42061	EPL_CROWD	0.7121	0.6061	0.1060
42061	SPL_THEME4	2.9393	2.8333	0.1060
42061	EPL_LIMENG	0.5000	0.4091	0.0909
42061	EP_GROUPQ	12.2472	12.2000	0.0472
42061	EP_HBURD	18.1536	18.2000	0.0464
42061	RPL_THEMES	0.5455	0.5000	0.0455
42061	RPL_THEME2	0.5455	0.5152	0.0303
42061	RPL_THEME4	0.7424	0.7121	0.0303
42061	EP_LIMENG	0.5286	0.5000	0.0286
42061	EP_CROWD	1.4244	1.4000	0.0244
42061	EP_AGE17	18.0773	18.1000	0.0227
42061	EP_MUNIT	1.8216	1.8000	0.0216
42061	EP_SNGPNT	5.5844	5.6000	0.0156
42061	EPL_SNGPNT	0.6970	0.6818	0.0152
42061	EP_POV150	20.0030	20.0000	0.0030
42061	EP_MINRTY	9.6999	9.7000	0.0001
42061	E_TOTPOP	45145.0000	45145.0000	0.0000
42061	E_HU	22727.0000	22727.0000	0.0000
42061	E_HH	16779.0000	16779.0000	0.0000
42061	E_POV150	7922.0000	7922.0000	0.0000
42061	E_UNEMP	1066.0000	1066.0000	0.0000
42061	E_HBURD	3046.0000	3046.0000	0.0000
42061	E_NOHSDP	3346.0000	3346.0000	0.0000
42061	E_UNINSUR	1896.0000	1896.0000	0.0000
42061	E_AGE65	9437.0000	9437.0000	0.0000
42061	E_AGE17	8161.0000	8161.0000	0.0000
42061	E_DISABL	6696.0000	6696.0000	0.0000
42061	E_SNGPNT	937.0000	937.0000	0.0000
42061	E_LIMENG	228.0000	228.0000	0.0000
42061	E_MINRTY	4379.0000	4379.0000	0.0000
42061	E_MUNIT	414.0000	414.0000	0.0000
42061	E_MOBILE	2954.0000	2954.0000	0.0000
42061	E_CROWD	239.0000	239.0000	0.0000
42061	E_NOVEH	1006.0000	1006.0000	0.0000
42061	E_GROUPQ	5529.0000	5529.0000	0.0000
42061	EP_UNEMP	5.4000	5.4000	0.0000
42061	EP_NOHSDP	10.2000	10.2000	0.0000
42061	EP_UNINSUR	4.6000	4.6000	0.0000
42061	EP_AGE65	20.9000	20.9000	0.0000
42061	EP_DISABL	16.4000	16.4000	0.0000
42061	EP_MOBILE	13.0000	13.0000	0.0000
42061	EP_NOVEH	6.0000	6.0000	0.0000
42061	EPL_POV150	0.4697	0.4697	0.0000
42061	EPL_UNEMP	0.6212	0.6212	0.0000
42061	EPL_HBURD	0.0303	0.0303	0.0000
42061	EPL_NOHSDP	0.6061	0.6061	0.0000
42061	EPL_UNINSUR	0.2727	0.2727	0.0000
42061	EPL_AGE65	0.5758	0.5758	0.0000
42061	EPL_AGE17	0.1364	0.1364	0.0000
42061	EPL_DISABL	0.6364	0.6364	0.0000
42061	EPL_MINRTY	0.5758	0.5758	0.0000
42061	EPL_MUNIT	0.1212	0.1212	0.0000
42061	EPL_MOBILE	0.8636	0.8636	0.0000
42061	EPL_NOVEH	0.2727	0.2727	0.0000
42061	EPL_GROUPQ	0.9697	0.9697	0.0000
42061	SPL_THEME1	2.0000	2.0000	0.0000
42061	SPL_THEME3	0.5758	0.5758	0.0000
42061	RPL_THEME1	0.2576	0.2576	0.0000
42061	RPL_THEME3	0.5758	0.5758	0.0000

Similarly, the difference in decimal places in EP_variables seems to be the major contributor for the discrepancy in downstream percentile rankings, especially for the variables in theme 2 and 4.

“Caveat” in CDC SVI documentation

In fact, CDC SVI documentation (before 2018) also includes a section called “Reproducibility Caveat” where they mention “results may differ slightly when replicating SVI using Microsoft Excel or similar software” due to “variation in the number of decimal places”, as CDC uses SQL for their SVI development. In 2020, this section was removed (and a different section of “Caveat for SVI State Databases” was added), but it seems that the calculation strategy in terms of decimal places is still the same. We could consider adjusting our calculation to match CDC’s strategy in the future.