SVI Calculation - Reproducing CDC SVI by Matching Decimal Places

As mentioned in the previous post, we noticed the minor differences between our calculated result and CDC-release SVI and attributed them to variation in rounding strategies. While CDC keeps one decimal place for EP_variables (using SQL), our calculation does not specify number rounding at that stage and therefore keeps more digits after decimal point. Here, to replicate CDC’s approach, we’ll modify the calculation with additional rounding for the EP_variables and see how well it reproduces CDC’s result.

Reproducing CDC SVI

We’ll take CT-level data for PA in 2018 as an example, and compare our updated result to CDC SVI.

Code

library(tidyverse)
library(knitr)

result_ct_pa2018 <- readRDS("../../../cdc_us_svi/result/pa_tract_result2018_decimal.rds")

svi_pa_2018 <- read_csv("../../../cdc_us_svi/cdc_svi_2018_pa_ct.csv") %>% 
  rename(GEOID = FIPS)

join_table <- function(cdc, diy){
  cdc %>% 
    select(
      GEOID,
      cdc_RPL_themes = RPL_THEMES,
      cdc_RPL_theme1 = RPL_THEME1,
      cdc_RPL_theme2 = RPL_THEME2,
      cdc_RPL_theme3 = RPL_THEME3,
      cdc_RPL_theme4 = RPL_THEME4
    ) %>%
    mutate(GEOID = paste(GEOID)) %>%
    left_join(
      diy %>%
        select(
          GEOID,
          RPL_themes,
          RPL_theme1,
          RPL_theme2,
          RPL_theme3,
          RPL_theme4
        )
    ) 
}

ct_check18 <- join_table(svi_pa_2018, result_ct_pa2018)

ct_check18 %>% 
  drop_na() %>% 
  filter_all(all_vars(.>=0)) %>% 
  ggplot(aes(x = cdc_RPL_themes, y = RPL_themes)) +
  geom_point(color = "#004C54")+
  geom_abline(slope = 1, intercept = 0)+
  labs(title = "CDC vs. calculated CT-level SVI for PA in 2018",
    subtitle = "Comparison of overall percentile ranking (RPLs)",
    y = "calculated overall RPL",
    x = "CDC overall RPL")+
  theme(plot.title = element_text(size= 15))

Good news is that the “wandering” data points from the previous post are staying much closer to the line now, with a correlation coefficient of 0.9999995.

Looking at the difference of the two versions of SVI in number:

Code

ct_diff18 <- ct_check18 %>%
  filter_all(all_vars(.>=0)) %>% 
  select(GEOID, cdc_RPL_themes, RPL_themes) %>% 
  mutate(diff_all = abs(cdc_RPL_themes- RPL_themes)) %>% 
  arrange(desc(diff_all))

ct_diff18 %>% head(15) %>% kable()

GEOID	cdc_RPL_themes	RPL_themes	diff_all
42003982200	0.4229	0.4253	0.0024
42003499300	0.3404	0.3427	0.0023
42041011302	0.1436	0.1457	0.0021
42045407501	0.1743	0.1764	0.0021
42079210300	0.2912	0.2933	0.0021
42011013000	0.1975	0.1995	0.0020
42029308101	0.3345	0.3365	0.0020
42073011600	0.3473	0.3493	0.0020
42003447000	0.0981	0.1001	0.0020
42003070500	0.1404	0.1423	0.0019
42003469000	0.1313	0.1332	0.0019
42017102401	0.2545	0.2564	0.0019
42045407201	0.2292	0.2311	0.0019
42077006600	0.1288	0.1307	0.0019
42077009200	0.3702	0.3721	0.0019

To look further into the GEOID with the largest differences in RPLs between our calculation and CDC:

Code

options(scipen = 9999)

diy18 <- result_ct_pa2018 %>% filter(GEOID == "42003982200") %>% 
  select(-NAME,
    SPL_THEMES = SPL_themes,
    SPL_THEME1 = SPL_theme1,
    SPL_THEME2 = SPL_theme2,
    SPL_THEME3 = SPL_theme3,
    SPL_THEME4 = SPL_theme4,
    RPL_THEMES = RPL_themes,
    RPL_THEME1 = RPL_theme1,
    RPL_THEME2 = RPL_theme2,
    RPL_THEME3 = RPL_theme3,
    RPL_THEME4 = RPL_theme4) %>% 
  pivot_longer(-1, names_to = "var_name", values_to = "hSVI") %>% 
  mutate(hSVI = round(hSVI, 4))


cdc18 <- svi_pa_2018 %>% filter(GEOID == "42003982200") %>% 
  select(-(1:5), -LOCATION, -AREA_SQMI) %>% 
  mutate(GEOID = paste(GEOID)) %>% 
  pivot_longer(-1, names_to = "var_name", values_to = "cSVI")

diff18 <- diy18 %>% 
  select(-GEOID) %>% 
  left_join(cdc18, by = "var_name") %>% 
  relocate(GEOID, .before = var_name) %>% 
  mutate(diff = abs(hSVI-cSVI)) %>% 
  arrange(desc(diff))

diff18 %>% kable()

GEOID	var_name	hSVI	cSVI	diff
42003982200	RPL_THEMES	0.4253	0.4229	0.0024
42003982200	RPL_THEME2	0.0025	0.0019	0.0006
42003982200	RPL_THEME1	0.7930	0.7926	0.0004
42003982200	RPL_THEME4	0.7636	0.7633	0.0003
42003982200	E_TOTPOP	4619.0000	4619.0000	0.0000
42003982200	E_HU	29.0000	29.0000	0.0000
42003982200	E_HH	11.0000	11.0000	0.0000
42003982200	E_POV	10.0000	10.0000	0.0000
42003982200	E_UNEMP	257.0000	257.0000	0.0000
42003982200	E_PCI	3240.0000	3240.0000	0.0000
42003982200	E_NOHSDP	0.0000	0.0000	0.0000
42003982200	E_AGE65	8.0000	8.0000	0.0000
42003982200	E_AGE17	97.0000	97.0000	0.0000
42003982200	E_DISABL	172.0000	172.0000	0.0000
42003982200	E_SNGPNT	0.0000	0.0000	0.0000
42003982200	E_MINRTY	843.0000	843.0000	0.0000
42003982200	E_LIMENG	5.0000	5.0000	0.0000
42003982200	E_MUNIT	15.0000	15.0000	0.0000
42003982200	E_MOBILE	0.0000	0.0000	0.0000
42003982200	E_CROWD	0.0000	0.0000	0.0000
42003982200	E_NOVEH	3.0000	3.0000	0.0000
42003982200	E_GROUPQ	4592.0000	4592.0000	0.0000
42003982200	EP_POV	37.0000	37.0000	0.0000
42003982200	EP_UNEMP	17.2000	17.2000	0.0000
42003982200	EP_PCI	3240.0000	3240.0000	0.0000
42003982200	EP_NOHSDP	0.0000	0.0000	0.0000
42003982200	EP_AGE65	0.2000	0.2000	0.0000
42003982200	EP_AGE17	2.1000	2.1000	0.0000
42003982200	EP_DISABL	3.7000	3.7000	0.0000
42003982200	EP_SNGPNT	0.0000	0.0000	0.0000
42003982200	EP_MINRTY	18.3000	18.3000	0.0000
42003982200	EP_LIMENG	0.1000	0.1000	0.0000
42003982200	EP_MUNIT	51.7000	51.7000	0.0000
42003982200	EP_MOBILE	0.0000	0.0000	0.0000
42003982200	EP_CROWD	0.0000	0.0000	0.0000
42003982200	EP_NOVEH	27.3000	27.3000	0.0000
42003982200	EP_GROUPQ	99.4000	99.4000	0.0000
42003982200	EPL_POV	0.9296	0.9296	0.0000
42003982200	EPL_UNEMP	0.9627	0.9627	0.0000
42003982200	EPL_PCI	0.9997	0.9997	0.0000
42003982200	EPL_NOHSDP	0.0000	0.0000	0.0000
42003982200	EPL_AGE65	0.0028	0.0028	0.0000
42003982200	EPL_AGE17	0.0097	0.0097	0.0000
42003982200	EPL_DISABL	0.0053	0.0053	0.0000
42003982200	EPL_SNGPNT	0.0000	0.0000	0.0000
42003982200	EPL_MINRTY	0.6123	0.6123	0.0000
42003982200	EPL_LIMENG	0.2583	0.2583	0.0000
42003982200	EPL_MUNIT	0.9837	0.9837	0.0000
42003982200	EPL_MOBILE	0.0000	0.0000	0.0000
42003982200	EPL_CROWD	0.0000	0.0000	0.0000
42003982200	EPL_NOVEH	0.8776	0.8776	0.0000
42003982200	EPL_GROUPQ	0.9972	0.9972	0.0000
42003982200	SPL_THEME1	2.8920	2.8920	0.0000
42003982200	SPL_THEME2	0.0178	0.0178	0.0000
42003982200	SPL_THEME3	0.8706	0.8706	0.0000
42003982200	SPL_THEME4	2.8585	2.8585	0.0000
42003982200	RPL_THEME3	0.4518	0.4518	0.0000
42003982200	SPL_THEMES	6.6389	6.6389	0.0000

For this tract, it looks like the differences seem to appear from the percentile ranking calculation stage (all variables are identical). We follow CDC’s calculation description for all percentile rankings, with the same significant digits and ties method. So the most likely culprit here might be the rounding method.

Caveat of `round()`

Our calculation specifies decimal places using round() , which comes with a tricky situation with rounding off a 5. As mentioned in its documentation (paraphrased):

the IEC 60559 standard is expected to be used (“go to the even digit”), but round(0.15, 1)could be either 0.1 or 0.2, depending on the OS services and on representation error.

In our case, round(0.15, 1) returns 0.1, but it appears that CDC’s rounding would return 0.2. For example, if we have a EP_variable value of 18.15 in our original calculation, it would show up as 18.2 in CDC SVI, whereas it would become 18.1 in our calculation after rounding.

This would only be a problem with numbers with the second digit after the decimal point as 5, and the tract shown above happens not to be affected. But other tracts might have values that are rounded down for our calculation and rounded up for CDC calculation, which in turn leads to differences in percentile rankings of a certain tract among all tracts.

One option is to add a new rounding function to the package, but for now, this is as close as we could reproduce the CDC SVI result, which is not too bad.

Reproducing CDC SVI

Caveat of round()

Caveat of `round()`