Reproducing CDC SVI by Matching Decimal Places

with CT-level PA data in 2018

Author

Heli Xu

Published

February 24, 2023

As mentioned in the previous post, we noticed the minor differences between our calculated result and CDC-release SVI and attributed them to variation in rounding strategies. While CDC keeps one decimal place for EP_variables (using SQL), our calculation does not specify number rounding at that stage and therefore keeps more digits after decimal point. Here, to replicate CDC’s approach, we’ll modify the calculation with additional rounding for the EP_variables and see how well it reproduces CDC’s result.

Reproducing CDC SVI

We’ll take CT-level data for PA in 2018 as an example, and compare our updated result to CDC SVI.

Code
library(tidyverse)
library(knitr)

result_ct_pa2018 <- readRDS("../../../cdc_us_svi/result/pa_tract_result2018_decimal.rds")

svi_pa_2018 <- read_csv("../../../cdc_us_svi/cdc_svi_2018_pa_ct.csv") %>% 
  rename(GEOID = FIPS)

join_table <- function(cdc, diy){
  cdc %>% 
    select(
      GEOID,
      cdc_RPL_themes = RPL_THEMES,
      cdc_RPL_theme1 = RPL_THEME1,
      cdc_RPL_theme2 = RPL_THEME2,
      cdc_RPL_theme3 = RPL_THEME3,
      cdc_RPL_theme4 = RPL_THEME4
    ) %>%
    mutate(GEOID = paste(GEOID)) %>%
    left_join(
      diy %>%
        select(
          GEOID,
          RPL_themes,
          RPL_theme1,
          RPL_theme2,
          RPL_theme3,
          RPL_theme4
        )
    ) 
}

ct_check18 <- join_table(svi_pa_2018, result_ct_pa2018)

ct_check18 %>% 
  drop_na() %>% 
  filter_all(all_vars(.>=0)) %>% 
  ggplot(aes(x = cdc_RPL_themes, y = RPL_themes)) +
  geom_point(color = "#004C54")+
  geom_abline(slope = 1, intercept = 0)+
  labs(title = "CDC vs. calculated CT-level SVI for PA in 2018",
    subtitle = "Comparison of overall percentile ranking (RPLs)",
    y = "calculated overall RPL",
    x = "CDC overall RPL")+
  theme(plot.title = element_text(size= 15))

Good news is that the “wandering” data points from the previous post are staying much closer to the line now, with a correlation coefficient of 0.9999995.

Looking at the difference of the two versions of SVI in number:

Code
ct_diff18 <- ct_check18 %>%
  filter_all(all_vars(.>=0)) %>% 
  select(GEOID, cdc_RPL_themes, RPL_themes) %>% 
  mutate(diff_all = abs(cdc_RPL_themes- RPL_themes)) %>% 
  arrange(desc(diff_all))

ct_diff18 %>% head(15) %>% kable()
GEOID cdc_RPL_themes RPL_themes diff_all
42003982200 0.4229 0.4253 0.0024
42003499300 0.3404 0.3427 0.0023
42041011302 0.1436 0.1457 0.0021
42045407501 0.1743 0.1764 0.0021
42079210300 0.2912 0.2933 0.0021
42011013000 0.1975 0.1995 0.0020
42029308101 0.3345 0.3365 0.0020
42073011600 0.3473 0.3493 0.0020
42003447000 0.0981 0.1001 0.0020
42003070500 0.1404 0.1423 0.0019
42003469000 0.1313 0.1332 0.0019
42017102401 0.2545 0.2564 0.0019
42045407201 0.2292 0.2311 0.0019
42077006600 0.1288 0.1307 0.0019
42077009200 0.3702 0.3721 0.0019

To look further into the GEOID with the largest differences in RPLs between our calculation and CDC:

Code
options(scipen = 9999)

diy18 <- result_ct_pa2018 %>% filter(GEOID == "42003982200") %>% 
  select(-NAME,
    SPL_THEMES = SPL_themes,
    SPL_THEME1 = SPL_theme1,
    SPL_THEME2 = SPL_theme2,
    SPL_THEME3 = SPL_theme3,
    SPL_THEME4 = SPL_theme4,
    RPL_THEMES = RPL_themes,
    RPL_THEME1 = RPL_theme1,
    RPL_THEME2 = RPL_theme2,
    RPL_THEME3 = RPL_theme3,
    RPL_THEME4 = RPL_theme4) %>% 
  pivot_longer(-1, names_to = "var_name", values_to = "hSVI") %>% 
  mutate(hSVI = round(hSVI, 4))


cdc18 <- svi_pa_2018 %>% filter(GEOID == "42003982200") %>% 
  select(-(1:5), -LOCATION, -AREA_SQMI) %>% 
  mutate(GEOID = paste(GEOID)) %>% 
  pivot_longer(-1, names_to = "var_name", values_to = "cSVI")

diff18 <- diy18 %>% 
  select(-GEOID) %>% 
  left_join(cdc18, by = "var_name") %>% 
  relocate(GEOID, .before = var_name) %>% 
  mutate(diff = abs(hSVI-cSVI)) %>% 
  arrange(desc(diff))

diff18 %>% kable()
GEOID var_name hSVI cSVI diff
42003982200 RPL_THEMES 0.4253 0.4229 0.0024
42003982200 RPL_THEME2 0.0025 0.0019 0.0006
42003982200 RPL_THEME1 0.7930 0.7926 0.0004
42003982200 RPL_THEME4 0.7636 0.7633 0.0003
42003982200 E_TOTPOP 4619.0000 4619.0000 0.0000
42003982200 E_HU 29.0000 29.0000 0.0000
42003982200 E_HH 11.0000 11.0000 0.0000
42003982200 E_POV 10.0000 10.0000 0.0000
42003982200 E_UNEMP 257.0000 257.0000 0.0000
42003982200 E_PCI 3240.0000 3240.0000 0.0000
42003982200 E_NOHSDP 0.0000 0.0000 0.0000
42003982200 E_AGE65 8.0000 8.0000 0.0000
42003982200 E_AGE17 97.0000 97.0000 0.0000
42003982200 E_DISABL 172.0000 172.0000 0.0000
42003982200 E_SNGPNT 0.0000 0.0000 0.0000
42003982200 E_MINRTY 843.0000 843.0000 0.0000
42003982200 E_LIMENG 5.0000 5.0000 0.0000
42003982200 E_MUNIT 15.0000 15.0000 0.0000
42003982200 E_MOBILE 0.0000 0.0000 0.0000
42003982200 E_CROWD 0.0000 0.0000 0.0000
42003982200 E_NOVEH 3.0000 3.0000 0.0000
42003982200 E_GROUPQ 4592.0000 4592.0000 0.0000
42003982200 EP_POV 37.0000 37.0000 0.0000
42003982200 EP_UNEMP 17.2000 17.2000 0.0000
42003982200 EP_PCI 3240.0000 3240.0000 0.0000
42003982200 EP_NOHSDP 0.0000 0.0000 0.0000
42003982200 EP_AGE65 0.2000 0.2000 0.0000
42003982200 EP_AGE17 2.1000 2.1000 0.0000
42003982200 EP_DISABL 3.7000 3.7000 0.0000
42003982200 EP_SNGPNT 0.0000 0.0000 0.0000
42003982200 EP_MINRTY 18.3000 18.3000 0.0000
42003982200 EP_LIMENG 0.1000 0.1000 0.0000
42003982200 EP_MUNIT 51.7000 51.7000 0.0000
42003982200 EP_MOBILE 0.0000 0.0000 0.0000
42003982200 EP_CROWD 0.0000 0.0000 0.0000
42003982200 EP_NOVEH 27.3000 27.3000 0.0000
42003982200 EP_GROUPQ 99.4000 99.4000 0.0000
42003982200 EPL_POV 0.9296 0.9296 0.0000
42003982200 EPL_UNEMP 0.9627 0.9627 0.0000
42003982200 EPL_PCI 0.9997 0.9997 0.0000
42003982200 EPL_NOHSDP 0.0000 0.0000 0.0000
42003982200 EPL_AGE65 0.0028 0.0028 0.0000
42003982200 EPL_AGE17 0.0097 0.0097 0.0000
42003982200 EPL_DISABL 0.0053 0.0053 0.0000
42003982200 EPL_SNGPNT 0.0000 0.0000 0.0000
42003982200 EPL_MINRTY 0.6123 0.6123 0.0000
42003982200 EPL_LIMENG 0.2583 0.2583 0.0000
42003982200 EPL_MUNIT 0.9837 0.9837 0.0000
42003982200 EPL_MOBILE 0.0000 0.0000 0.0000
42003982200 EPL_CROWD 0.0000 0.0000 0.0000
42003982200 EPL_NOVEH 0.8776 0.8776 0.0000
42003982200 EPL_GROUPQ 0.9972 0.9972 0.0000
42003982200 SPL_THEME1 2.8920 2.8920 0.0000
42003982200 SPL_THEME2 0.0178 0.0178 0.0000
42003982200 SPL_THEME3 0.8706 0.8706 0.0000
42003982200 SPL_THEME4 2.8585 2.8585 0.0000
42003982200 RPL_THEME3 0.4518 0.4518 0.0000
42003982200 SPL_THEMES 6.6389 6.6389 0.0000

For this tract, it looks like the differences seem to appear from the percentile ranking calculation stage (all variables are identical). We follow CDC’s calculation description for all percentile rankings, with the same significant digits and ties method. So the most likely culprit here might be the rounding method.

Caveat of round()

Our calculation specifies decimal places using round() , which comes with a tricky situation with rounding off a 5. As mentioned in its documentation (paraphrased):

the IEC 60559 standard is expected to be used (“go to the even digit”), but round(0.15, 1)could be either 0.1 or 0.2, depending on the OS services and on representation error.

In our case, round(0.15, 1) returns 0.1, but it appears that CDC’s rounding would return 0.2. For example, if we have a EP_variable value of 18.15 in our original calculation, it would show up as 18.2 in CDC SVI, whereas it would become 18.1 in our calculation after rounding.

This would only be a problem with numbers with the second digit after the decimal point as 5, and the tract shown above happens not to be affected. But other tracts might have values that are rounded down for our calculation and rounded up for CDC calculation, which in turn leads to differences in percentile rankings of a certain tract among all tracts.

One option is to add a new rounding function to the package, but for now, this is as close as we could reproduce the CDC SVI result, which is not too bad.