Code
library(dplyr)
library(sf)
library(readr)
library(reactable)
Part 1: Comparing annotators’ reports, AI training data and predictions with GIS data
Heli Xu
April 17, 2024
This post is to document the process of the (preliminary) assessments of the reliability of the AI model for built environment (BE) features, by comparing different data sources of BE features at the street level. Here, we are testing the reliability of training data for the AI model, model-based predictions and annotators’ reports, using GIS data or as reference; In another post, we are also testing the GIS/training/prediction/reports data against Computer Assisted Neighborhood Visual Assessment System (CANVAS) data, which is considered a gold standard for BE data collection.
In contrast to the AI testing and validation process, BE features data from different sources are aggregated to the street level for comparison.
The code related to this part (and the next part) of the work is stored in train_predict_gis_comparison.R
, and linked here. Alex has a Stata code for GIS-prediction comparison (discussed in the next section), and the R scripts here are following a similar workflow and naming fashion.
Raw data file is located at Data/Bogota/Annotations/2024_02_12/annotations.csv
. Using the latitude and longitude columns, we can construct the coordinates and join that to the street (calle) level (point to polygons, with st_within
). On a side note, there are 343 unmatched data points that are outside of the calle polygons (shown below).
After spatially joining the training data to the street level, we can group by the street ID (CodigoCL
) and sum up the counts for each street. For distinguishing variables about similar features between sources, we are adding “tr_” to all the variables from training data.
The raw data for GIS data is the Calle_datos/
shapefiles, or the STREET_LEVEL.xlsx
with updated column names (that match the codebook).
Most of the cleaning is renaming the columns for easier understanding (following the styles by Alex’s Stata code), with an additional prefix of “gis_”.
Last, we are joining the training and GIS data by street IDs, and generate binary variables based on street-level counts (1 if counts >0, 0 if otherwise). Most of the binary columns will have a suffix of “_yn”, with the following exceptions (these are also binary variables): gis_sw
, gis_bike_lane
, gis_any_bus
and gis_median
. Notably, some binary variables may be used to derive additional binary variables. For example, tr_pedxwalk_yn
is set as 1 if either tr_sign_crossing_yn
or tr_crosswalk_yn
is 1.
Total street count is 8132.
Given that locations of data points vary from different sources, the aggregated street-level counts may not be compatible for comparison. Here we are only comparing the binary variables of equivalent/similar features, including traffic signs, traffic lights, crosswalk, stop signs, yield signs, school zone signs, sidewalks, bike lanes, bus lanes, medians, speed bumps, trees, bus stops, parked vehicles, parking lanes, and BRT stations. For quick reference, below shows the summary table of the reliability metrics, and the ROC curves.
Reliability metrics comparing training data vs GIS data
Overall, traffic lights, medians, sidewalks, trees, bus and BRT stops seem to have higher agreement between training and GIS data.
The code related to this part (and the previous part) of the work is stored in train_predict_gis_comparison.R
, and linked here.
Alex completed joining the prediction and GIS data (with SES level and road types) at the street level (Data/Bogota/Annotations/2024_02_12/predictions_st_mv.dbf
), so we’ll use that to perform further processing:
rename the columns: prefix “an_” for prediction data, prefix “gis_” for GIS data.
derive the binary columns.
With NAs removed in street id, total street count is 7332.
Reliability metrics comparing prediction data vs GIS data
Traffic lights, sidewalks, median, trees, bus and BRT stops remain to have relatively high agreement between prediction and GIS data. In addition, bike lanes also showed much higher agreement in prediction-GIS comparison than training-GIS comparison.
In the image below, prediction data is shown in orange, while training data is shown in blue. The slightly darker shade of blue indicates the streets that are present in both training and prediction data aggregated to street level (~1000 streets).
The code related to this part of the work is stored in annotator_gis_comparison.R
, and linked here.
These are another set of data from human annotators stored in .json format (in Bogota/Annotations/Annotations_2023_04_23/
folder). Each json contains a list of lists/tables (as shown below for test.json
), and the more important and relevant information for us are the coordinates in the image file names (in red rectangle), the categories of the annotation (orange underlined: IDs and names) and the actual annotations. We’ll match the category name to the annotations by category id, match the image coordinates to the annotations by image id, combine the three json files (test.json
, train.json
and eval.json
) together, and use the coordinates to join the point data to street level.
Last, we’ll join GIS data by street IDs and rename/derive the columns.
Rename: prefix “js_” for annotators’ report, “gis_” for GIS data.
Binary variables.
Total street count is 9178.
Reliability metrics comparing annotation data vs GIS data
Consistently, traffic lights, sidewalks, medians, bike lanes, trees, bus and BRT stops show better agreement between annotators’ data and GIS data.
In summary, compared with GIS data, several features have fair to substantial agreement across training, prediction and annotators’ data, including traffic lights, sidewalks, medians, trees and bus and BRT stops (potentially bike lanes).