SVI calculation validation: why are they not identical?

CT- and CTY- level comparison for PA in 2018 and 2020, respectively

Author

Heli Xu

Published

February 22, 2023

In the SVI calculation validation process, while we we see very strong correlations between our calculated result and CDC-released SVI of the same year at the county and census tract level (detailed in previous post), we do observe minor differences. Here, we explore the reasons why our calculation is not identical with CDC SVI.

For example, here is a scatter plot showing the correlation of the two versions of CT-level SVI for PA in 2018 (the tract with largest difference between CDC and calculated RPL highlighted in red):

Code
library(tidyverse)
library(patchwork)
library(knitr)


result_ct_pa2018 <- readRDS("../../../cdc_us_svi/result/pa_tract_result2018.rds")

svi_pa_2018 <- read_csv("../../../cdc_us_svi/cdc_svi_2018_pa_ct.csv") %>% 
  rename(GEOID = FIPS)

#bad_tract = '42071010900'

#make a function for joining CDC SVI and our results
join_table <- function(cdc, diy){
  cdc %>% 
    select(
      GEOID,
      cdc_RPL_themes = RPL_THEMES,
      cdc_RPL_theme1 = RPL_THEME1,
      cdc_RPL_theme2 = RPL_THEME2,
      cdc_RPL_theme3 = RPL_THEME3,
      cdc_RPL_theme4 = RPL_THEME4
    ) %>%
    mutate(GEOID = paste(GEOID)) %>%
    left_join(
      diy %>%
        select(
          GEOID,
          RPL_themes,
          RPL_theme1,
          RPL_theme2,
          RPL_theme3,
          RPL_theme4
        )
    ) 
}

ct_check18 <-  join_table(svi_pa_2018, result_ct_pa2018)

ct_check18 %>% 
  drop_na() %>% 
  filter_all(all_vars(.>=0)) %>% 
  ggplot(aes(x = cdc_RPL_themes, y = RPL_themes)) +
  geom_point(color = "#004C54")+
  geom_point(data = ct_check18 %>% filter(GEOID == "42071010900"),
    aes(x = cdc_RPL_themes, y = RPL_themes),
    color = 'red')+
  geom_abline(slope = 1, intercept = 0)+
  labs(title = "CDC vs. calculated CT-level SVI for PA in 2018",
    subtitle = "Comparison of overall percentile ranking (RPLs)",
    y = "calculated overall RPL",
    x = "CDC overall RPL")+
  theme(plot.title = element_text(size= 15))

There are clearly some dots wandering (a bit) away from the line. If we’re retrieving data at the same geographic level and following the same calculation procedure as CDC does, why would there be differences at all?

Minor differences at CT level

Using the same example as the plot above, we’ll specifically look at the difference (absolute value) between the two versions of overall RPLs for all tracts (GEOIDs). Arranging the difference in descending order, we can get a glance at the tracts with relatively high discrepancy between the RPLs from CDC SVI and ours (first 15 rows are shown below).

Code
ct_diff18 <- ct_check18 %>%
  filter_all(all_vars(.>=0)) %>% 
  select(GEOID, cdc_RPL_themes, RPL_themes) %>% 
  mutate(diff_all = abs(cdc_RPL_themes- RPL_themes)) %>% 
  arrange(desc(diff_all))

ct_diff18 %>% head(15) %>% kable()
GEOID cdc_RPL_themes RPL_themes diff_all
42071010900 0.3483 0.4098 0.0615
42011012300 0.5824 0.6330 0.0506
42021011100 0.3564 0.4058 0.0494
42071010701 0.2624 0.3117 0.0493
42003488600 0.3160 0.3623 0.0463
42003523300 0.5320 0.5771 0.0451
42065950600 0.5674 0.6124 0.0450
42007605100 0.4953 0.5402 0.0449
42107002900 0.2461 0.2904 0.0443
42107003000 0.2665 0.3104 0.0439
42009960900 0.5755 0.6186 0.0431
42057960200 0.5125 0.5549 0.0424
42117950400 0.5944 0.6368 0.0424
42089300201 0.2433 0.2854 0.0421
42007605800 0.2426 0.2845 0.0419

Zooming in on the GEOID with the largest difference between the two versions of RPLs, 42071010900, we could extract information on individual variable from our calculated SVI and CDC SVI. (In the table below, calculated SVI is denoted as “hSVI”, and CDC SVI is denoted as “cSVI;”diff” column contains the difference between hSVI and cSVI and arranged in descending order.)

Code
options(scipen = 9999)

diy18 <- result_ct_pa2018 %>% filter(GEOID == "42071010900") %>% 
  select(-NAME,
    SPL_THEMES = SPL_themes,
    SPL_THEME1 = SPL_theme1,
    SPL_THEME2 = SPL_theme2,
    SPL_THEME3 = SPL_theme3,
    SPL_THEME4 = SPL_theme4,
    RPL_THEMES = RPL_themes,
    RPL_THEME1 = RPL_theme1,
    RPL_THEME2 = RPL_theme2,
    RPL_THEME3 = RPL_theme3,
    RPL_THEME4 = RPL_theme4) %>% 
  pivot_longer(-1, names_to = "var_name", values_to = "hSVI") %>% 
  mutate(hSVI = round(hSVI, 4))


cdc18 <- svi_pa_2018 %>% filter(GEOID == "42071010900") %>% 
  select(-(1:5), -LOCATION, -AREA_SQMI) %>% 
  mutate(GEOID = paste(GEOID)) %>% 
  pivot_longer(-1, names_to = "var_name", values_to = "cSVI")

diff18 <- diy18 %>% 
  select(-GEOID) %>% 
  left_join(cdc18, by = "var_name") %>% 
  relocate(GEOID, .before = var_name) %>% 
  mutate(diff = abs(hSVI-cSVI)) %>% 
  arrange(desc(diff))

diff18 %>% kable()
GEOID var_name hSVI cSVI diff
42071010900 SPL_THEMES 6.5903 6.2476 0.3427
42071010900 SPL_THEME4 2.5540 2.2239 0.3301
42071010900 EPL_GROUPQ 0.3158 0.0000 0.3158
42071010900 RPL_THEME4 0.6143 0.4693 0.1450
42071010900 RPL_THEMES 0.4098 0.3483 0.0615
42071010900 EP_MUNIT 5.8613 5.9000 0.0387
42071010900 EP_SNGPNT 6.2356 6.2000 0.0356
42071010900 EP_LIMENG 0.5653 0.6000 0.0347
42071010900 EP_AGE17 20.4663 20.5000 0.0337
42071010900 EP_MINRTY 16.7283 16.7000 0.0283
42071010900 EP_GROUPQ 0.0247 0.0000 0.0247
42071010900 EP_CROWD 1.3218 1.3000 0.0218
42071010900 EPL_CROWD 0.6293 0.6160 0.0133
42071010900 SPL_THEME3 1.0831 1.0760 0.0071
42071010900 EPL_LIMENG 0.4959 0.4900 0.0059
42071010900 SPL_THEME2 1.4882 1.4827 0.0055
42071010900 EPL_SNGPNT 0.4236 0.4190 0.0046
42071010900 RPL_THEME3 0.5707 0.5741 0.0034
42071010900 EPL_MINRTY 0.5872 0.5860 0.0012
42071010900 RPL_THEME2 0.1882 0.1871 0.0011
42071010900 EPL_MUNIT 0.5947 0.5938 0.0009
42071010900 EPL_AGE17 0.5003 0.4994 0.0009
42071010900 RPL_THEME1 0.3377 0.3374 0.0003
42071010900 EPL_MOBILE 0.6961 0.6960 0.0001
42071010900 E_TOTPOP 8106.0000 8106.0000 0.0000
42071010900 E_HU 3634.0000 3634.0000 0.0000
42071010900 E_HH 3480.0000 3480.0000 0.0000
42071010900 E_POV 541.0000 541.0000 0.0000
42071010900 E_UNEMP 200.0000 200.0000 0.0000
42071010900 E_PCI 32199.0000 32199.0000 0.0000
42071010900 E_NOHSDP 426.0000 426.0000 0.0000
42071010900 E_AGE65 1132.0000 1132.0000 0.0000
42071010900 E_AGE17 1659.0000 1659.0000 0.0000
42071010900 E_DISABL 933.0000 933.0000 0.0000
42071010900 E_SNGPNT 217.0000 217.0000 0.0000
42071010900 E_MINRTY 1356.0000 1356.0000 0.0000
42071010900 E_LIMENG 43.0000 43.0000 0.0000
42071010900 E_MUNIT 213.0000 213.0000 0.0000
42071010900 E_MOBILE 97.0000 97.0000 0.0000
42071010900 E_CROWD 46.0000 46.0000 0.0000
42071010900 E_NOVEH 150.0000 150.0000 0.0000
42071010900 E_GROUPQ 2.0000 2.0000 0.0000
42071010900 EP_POV 6.8000 6.8000 0.0000
42071010900 EP_UNEMP 4.0000 4.0000 0.0000
42071010900 EP_PCI 32199.0000 32199.0000 0.0000
42071010900 EP_NOHSDP 7.3000 7.3000 0.0000
42071010900 EP_AGE65 14.0000 14.0000 0.0000
42071010900 EP_DISABL 11.5000 11.5000 0.0000
42071010900 EP_MOBILE 2.7000 2.7000 0.0000
42071010900 EP_NOVEH 4.3000 4.3000 0.0000
42071010900 EPL_POV 0.3207 0.3207 0.0000
42071010900 EPL_UNEMP 0.3312 0.3312 0.0000
42071010900 EPL_PCI 0.3994 0.3994 0.0000
42071010900 EPL_NOHSDP 0.4137 0.4137 0.0000
42071010900 EPL_AGE65 0.2699 0.2699 0.0000
42071010900 EPL_DISABL 0.2944 0.2944 0.0000
42071010900 EPL_NOVEH 0.3181 0.3181 0.0000
42071010900 SPL_THEME1 1.4650 1.4650 0.0000

From the side-by-side comparison, the minor discrepancy in RPLs (and SPLs, EPLs) is most likely due to different number of decimal places in some EP_variables. While cSVI is using one decimal places for the EP_variables (when calculation is required, that is, when the percentage information cannot be retrieved directly from census data), hSVI does not specify decimal places and therefore shows more digits after decimal point.

Minor differences at CTY level

Additionally, we could take a closer look at the minor difference at the county level, using the SVIs for PA in 2020 this time. Below shows a correlation scatter plot of overall RPLs between CDC and calculated version, with the most different data points (tied) in red.

Code
result2020_co <- readRDS("../../../cdc_us_svi/result/pa_co_result2020.rds")

svi_pa_2020co <- read_csv("../../../download/2020svi_pa_co_cdc.csv") %>% 
  rename(GEOID = FIPS)

co_check20 <- join_table(svi_pa_2020co, result2020_co)

co_check20 %>% 
  drop_na() %>% 
  filter_all(all_vars(.>=0)) %>% 
  ggplot(aes(x = cdc_RPL_themes, y = RPL_themes)) +
  geom_point(color = "#191970")+
  geom_point(data = co_check20 %>% filter(GEOID%in%c("42067", "42061")),
    aes(x = cdc_RPL_themes, y = RPL_themes),
    color = 'red')+
  geom_abline(slope = 1, intercept = 0)+
  labs(title = "CDC vs. calculated CTY-level SVI for PA in 2020",
    subtitle = "Comparison of overall percentile ranking (RPLs)",
    y = "calculated overall RPL",
    x = "CDC overall RPL")+
  theme(plot.title = element_text(size= 15))

Also, the top 15 counties that have the largest difference between CDC and our calculated RPLs are included below:

Code
co_diff20 <- co_check20 %>% 
  filter_all(all_vars(.>=0)) %>% 
  select(GEOID, cdc_RPL_themes, RPL_themes) %>% 
  mutate(diff_all = abs(cdc_RPL_themes- RPL_themes)) %>% 
  arrange(desc(diff_all))

co_diff20 %>% head(15) %>% kable()
GEOID cdc_RPL_themes RPL_themes diff_all
42061 0.5000 0.5455 0.0455
42067 0.5455 0.5000 0.0455
42015 0.6667 0.6364 0.0303
42037 0.1515 0.1212 0.0303
42059 0.4091 0.3788 0.0303
42087 0.7121 0.7424 0.0303
42107 0.6364 0.6667 0.0303
42055 0.3939 0.4091 0.0152
42003 0.2879 0.2727 0.0152
42033 0.8939 0.9091 0.0152
42039 0.8030 0.8182 0.0152
42063 0.8182 0.8030 0.0152
42075 0.9091 0.8939 0.0152
42079 0.9394 0.9242 0.0152
42083 0.5909 0.6061 0.0152

Taking county GEOID 42061 as an example, we could compare the values for all variables from SVI (denotation same as above; “diff” column is arranged in descending order):

Code
diy20 <- result2020_co %>% filter(GEOID == "42061") %>% 
  select(-NAME,
    SPL_THEMES = SPL_themes,
    SPL_THEME1 = SPL_theme1,
    SPL_THEME2 = SPL_theme2,
    SPL_THEME3 = SPL_theme3,
    SPL_THEME4 = SPL_theme4,
    RPL_THEMES = RPL_themes,
    RPL_THEME1 = RPL_theme1,
    RPL_THEME2 = RPL_theme2,
    RPL_THEME3 = RPL_theme3,
    RPL_THEME4 = RPL_theme4) %>% 
  pivot_longer(-1, names_to = "var_name", values_to = "hSVI") %>% 
  mutate(hSVI = round(hSVI, 4))


cdc20 <- svi_pa_2020co %>% filter(GEOID == "42061") %>% 
  select(-(1:5), -LOCATION, -AREA_SQMI) %>% 
  mutate(GEOID = paste(GEOID)) %>% 
  pivot_longer(-1, names_to = "var_name", values_to = "cSVI")

diff20 <- diy20 %>% 
  select(-GEOID) %>% 
  left_join(cdc20, by = "var_name") %>% 
  relocate(GEOID, .before = var_name) %>% 
  mutate(diff = abs(hSVI-cSVI)) %>% 
  arrange(desc(diff))

diff20 %>% kable()
GEOID var_name hSVI cSVI diff
42061 SPL_THEMES 8.0607 7.8486 0.2121
42061 SPL_THEME2 2.5456 2.4395 0.1061
42061 EPL_CROWD 0.7121 0.6061 0.1060
42061 SPL_THEME4 2.9393 2.8333 0.1060
42061 EPL_LIMENG 0.5000 0.4091 0.0909
42061 EP_GROUPQ 12.2472 12.2000 0.0472
42061 EP_HBURD 18.1536 18.2000 0.0464
42061 RPL_THEMES 0.5455 0.5000 0.0455
42061 RPL_THEME2 0.5455 0.5152 0.0303
42061 RPL_THEME4 0.7424 0.7121 0.0303
42061 EP_LIMENG 0.5286 0.5000 0.0286
42061 EP_CROWD 1.4244 1.4000 0.0244
42061 EP_AGE17 18.0773 18.1000 0.0227
42061 EP_MUNIT 1.8216 1.8000 0.0216
42061 EP_SNGPNT 5.5844 5.6000 0.0156
42061 EPL_SNGPNT 0.6970 0.6818 0.0152
42061 EP_POV150 20.0030 20.0000 0.0030
42061 EP_MINRTY 9.6999 9.7000 0.0001
42061 E_TOTPOP 45145.0000 45145.0000 0.0000
42061 E_HU 22727.0000 22727.0000 0.0000
42061 E_HH 16779.0000 16779.0000 0.0000
42061 E_POV150 7922.0000 7922.0000 0.0000
42061 E_UNEMP 1066.0000 1066.0000 0.0000
42061 E_HBURD 3046.0000 3046.0000 0.0000
42061 E_NOHSDP 3346.0000 3346.0000 0.0000
42061 E_UNINSUR 1896.0000 1896.0000 0.0000
42061 E_AGE65 9437.0000 9437.0000 0.0000
42061 E_AGE17 8161.0000 8161.0000 0.0000
42061 E_DISABL 6696.0000 6696.0000 0.0000
42061 E_SNGPNT 937.0000 937.0000 0.0000
42061 E_LIMENG 228.0000 228.0000 0.0000
42061 E_MINRTY 4379.0000 4379.0000 0.0000
42061 E_MUNIT 414.0000 414.0000 0.0000
42061 E_MOBILE 2954.0000 2954.0000 0.0000
42061 E_CROWD 239.0000 239.0000 0.0000
42061 E_NOVEH 1006.0000 1006.0000 0.0000
42061 E_GROUPQ 5529.0000 5529.0000 0.0000
42061 EP_UNEMP 5.4000 5.4000 0.0000
42061 EP_NOHSDP 10.2000 10.2000 0.0000
42061 EP_UNINSUR 4.6000 4.6000 0.0000
42061 EP_AGE65 20.9000 20.9000 0.0000
42061 EP_DISABL 16.4000 16.4000 0.0000
42061 EP_MOBILE 13.0000 13.0000 0.0000
42061 EP_NOVEH 6.0000 6.0000 0.0000
42061 EPL_POV150 0.4697 0.4697 0.0000
42061 EPL_UNEMP 0.6212 0.6212 0.0000
42061 EPL_HBURD 0.0303 0.0303 0.0000
42061 EPL_NOHSDP 0.6061 0.6061 0.0000
42061 EPL_UNINSUR 0.2727 0.2727 0.0000
42061 EPL_AGE65 0.5758 0.5758 0.0000
42061 EPL_AGE17 0.1364 0.1364 0.0000
42061 EPL_DISABL 0.6364 0.6364 0.0000
42061 EPL_MINRTY 0.5758 0.5758 0.0000
42061 EPL_MUNIT 0.1212 0.1212 0.0000
42061 EPL_MOBILE 0.8636 0.8636 0.0000
42061 EPL_NOVEH 0.2727 0.2727 0.0000
42061 EPL_GROUPQ 0.9697 0.9697 0.0000
42061 SPL_THEME1 2.0000 2.0000 0.0000
42061 SPL_THEME3 0.5758 0.5758 0.0000
42061 RPL_THEME1 0.2576 0.2576 0.0000
42061 RPL_THEME3 0.5758 0.5758 0.0000

Similarly, the difference in decimal places in EP_variables seems to be the major contributor for the discrepancy in downstream percentile rankings, especially for the variables in theme 2 and 4.

“Caveat” in CDC SVI documentation

In fact, CDC SVI documentation (before 2018) also includes a section called “Reproducibility Caveat” where they mention “results may differ slightly when replicating SVI using Microsoft Excel or similar software” due to “variation in the number of decimal places”, as CDC uses SQL for their SVI development. In 2020, this section was removed (and a different section of “Caveat for SVI State Databases” was added), but it seems that the calculation strategy in terms of decimal places is still the same. We could consider adjusting our calculation to match CDC’s strategy in the future.