As mentioned in the previous post, we noticed the minor differences between our calculated result and CDC-release SVI and attributed them to variation in rounding strategies. While CDC keeps one decimal place for EP_variables (using SQL), our calculation does not specify number rounding at that stage and therefore keeps more digits after decimal point. Here, to replicate CDC’s approach, we’ll modify the calculation with additional rounding for the EP_variables and see how well it reproduces CDC’s result.
Reproducing CDC SVI
We’ll take CT-level data for PA in 2018 as an example, and compare our updated result to CDC SVI.
Good news is that the “wandering” data points from the previous post are staying much closer to the line now, with a correlation coefficient of 0.9999995.
Looking at the difference of the two versions of SVI in number:
For this tract, it looks like the differences seem to appear from the percentile ranking calculation stage (all variables are identical). We follow CDC’s calculation description for all percentile rankings, with the same significant digits and ties method. So the most likely culprit here might be the rounding method.
Caveat of round()
Our calculation specifies decimal places using round() , which comes with a tricky situation with rounding off a 5. As mentioned in its documentation (paraphrased):
the IEC 60559 standard is expected to be used (“go to the even digit”), but round(0.15, 1)could be either 0.1 or 0.2, depending on the OS services and on representation error.
In our case, round(0.15, 1) returns 0.1, but it appears that CDC’s rounding would return 0.2. For example, if we have a EP_variable value of 18.15 in our original calculation, it would show up as 18.2 in CDC SVI, whereas it would become 18.1 in our calculation after rounding.
This would only be a problem with numbers with the second digit after the decimal point as 5, and the tract shown above happens not to be affected. But other tracts might have values that are rounded down for our calculation and rounded up for CDC calculation, which in turn leads to differences in percentile rankings of a certain tract among all tracts.
One option is to add a new rounding function to the package, but for now, this is as close as we could reproduce the CDC SVI result, which is not too bad.