SVI calculation validation: why are they not identical?
CT- and CTY- level comparison for PA in 2018 and 2020, respectively
Author
Heli Xu
Published
February 22, 2023
In the SVI calculation validation process, while we we see very strong correlations between our calculated result and CDC-released SVI of the same year at the county and census tract level (detailed in previous post), we do observe minor differences. Here, we explore the reasons why our calculation is not identical with CDC SVI.
For example, here is a scatter plot showing the correlation of the two versions of CT-level SVI for PA in 2018 (the tract with largest difference between CDC and calculated RPL highlighted in red):
Code
library(tidyverse)library(patchwork)library(knitr)result_ct_pa2018 <-readRDS("../../../cdc_us_svi/result/pa_tract_result2018.rds")svi_pa_2018 <-read_csv("../../../cdc_us_svi/cdc_svi_2018_pa_ct.csv") %>%rename(GEOID = FIPS)#bad_tract = '42071010900'#make a function for joining CDC SVI and our resultsjoin_table <-function(cdc, diy){ cdc %>%select( GEOID,cdc_RPL_themes = RPL_THEMES,cdc_RPL_theme1 = RPL_THEME1,cdc_RPL_theme2 = RPL_THEME2,cdc_RPL_theme3 = RPL_THEME3,cdc_RPL_theme4 = RPL_THEME4 ) %>%mutate(GEOID =paste(GEOID)) %>%left_join( diy %>%select( GEOID, RPL_themes, RPL_theme1, RPL_theme2, RPL_theme3, RPL_theme4 ) ) }ct_check18 <-join_table(svi_pa_2018, result_ct_pa2018)ct_check18 %>%drop_na() %>%filter_all(all_vars(.>=0)) %>%ggplot(aes(x = cdc_RPL_themes, y = RPL_themes)) +geom_point(color ="#004C54")+geom_point(data = ct_check18 %>%filter(GEOID =="42071010900"),aes(x = cdc_RPL_themes, y = RPL_themes),color ='red')+geom_abline(slope =1, intercept =0)+labs(title ="CDC vs. calculated CT-level SVI for PA in 2018",subtitle ="Comparison of overall percentile ranking (RPLs)",y ="calculated overall RPL",x ="CDC overall RPL")+theme(plot.title =element_text(size=15))
There are clearly some dots wandering (a bit) away from the line. If we’re retrieving data at the same geographic level and following the same calculation procedure as CDC does, why would there be differences at all?
Minor differences at CT level
Using the same example as the plot above, we’ll specifically look at the difference (absolute value) between the two versions of overall RPLs for all tracts (GEOIDs). Arranging the difference in descending order, we can get a glance at the tracts with relatively high discrepancy between the RPLs from CDC SVI and ours (first 15 rows are shown below).
Zooming in on the GEOID with the largest difference between the two versions of RPLs, 42071010900, we could extract information on individual variable from our calculated SVI and CDC SVI. (In the table below, calculated SVI is denoted as “hSVI”, and CDC SVI is denoted as “cSVI;”diff” column contains the difference between hSVI and cSVI and arranged in descending order.)
From the side-by-side comparison, the minor discrepancy in RPLs (and SPLs, EPLs) is most likely due to different number of decimal places in some EP_variables. While cSVI is using one decimal places for the EP_variables (when calculation is required, that is, when the percentage information cannot be retrieved directly from census data), hSVI does not specify decimal places and therefore shows more digits after decimal point.
Minor differences at CTY level
Additionally, we could take a closer look at the minor difference at the county level, using the SVIs for PA in 2020 this time. Below shows a correlation scatter plot of overall RPLs between CDC and calculated version, with the most different data points (tied) in red.
Code
result2020_co <-readRDS("../../../cdc_us_svi/result/pa_co_result2020.rds")svi_pa_2020co <-read_csv("../../../download/2020svi_pa_co_cdc.csv") %>%rename(GEOID = FIPS)co_check20 <-join_table(svi_pa_2020co, result2020_co)co_check20 %>%drop_na() %>%filter_all(all_vars(.>=0)) %>%ggplot(aes(x = cdc_RPL_themes, y = RPL_themes)) +geom_point(color ="#191970")+geom_point(data = co_check20 %>%filter(GEOID%in%c("42067", "42061")),aes(x = cdc_RPL_themes, y = RPL_themes),color ='red')+geom_abline(slope =1, intercept =0)+labs(title ="CDC vs. calculated CTY-level SVI for PA in 2020",subtitle ="Comparison of overall percentile ranking (RPLs)",y ="calculated overall RPL",x ="CDC overall RPL")+theme(plot.title =element_text(size=15))
Also, the top 15 counties that have the largest difference between CDC and our calculated RPLs are included below:
Taking county GEOID 42061 as an example, we could compare the values for all variables from SVI (denotation same as above; “diff” column is arranged in descending order):
Similarly, the difference in decimal places in EP_variables seems to be the major contributor for the discrepancy in downstream percentile rankings, especially for the variables in theme 2 and 4.
“Caveat” in CDC SVI documentation
In fact, CDC SVI documentation (before 2018) also includes a section called “Reproducibility Caveat” where they mention “results may differ slightly when replicating SVI using Microsoft Excel or similar software” due to “variation in the number of decimal places”, as CDC uses SQL for their SVI development. In 2020, this section was removed (and a different section of “Caveat for SVI State Databases” was added), but it seems that the calculation strategy in terms of decimal places is still the same. We could consider adjusting our calculation to match CDC’s strategy in the future.