Built Environment, Pedestrian Injuries, and Deep Learning

In this report, my goal is to get familiar with the variables (indicators), what they represent (according to codebook), how complete the datasets are, and potentially examine the distribution of the variables.

ZAT level

At ZAT level, two files (folders) are included (along with the codebook).

ZAT.zip: boundary files – a table of 1141 rows, 6 cols (each row being a ZAT unit);
Intersections/ZAT_INDICADORES.xlsx: a table of 919 rows, 18 cols (each row being a ZAT unit).

ZAT folder

To begin with, I looked at the ZAT folder, which contains shapefiles that define boundaries of each ZAT zone.

Code

library(readxl)
library(dplyr)
library(tidyr)
library(ggplot2)
library(readr)
library(sf)
library(leaflet)
library(reactable)
library(crosstalk)

zat <- st_read("../../../../data/ZAT/ZAT_geo/ZAT.shp")

Reading layer `ZAT' from data source 
  `D:\LocalGitHub\BEPIDL\data\ZAT\ZAT_geo\ZAT.shp' using driver `ESRI Shapefile'
Simple feature collection with 1141 features and 5 fields
Geometry type: MULTIPOLYGON
Dimension:     XY, XYZ
Bounding box:  xmin: -74.88439 ymin: 3.70344 xmax: -73.05211 ymax: 5.83076
z_range:       zmin: 0 zmax: 0
Geodetic CRS:  MAGNA-SIRGAS

Code

zat

Simple feature collection with 1141 features and 5 fields
Geometry type: MULTIPOLYGON
Dimension:     XY, XYZ
Bounding box:  xmin: -74.88439 ymin: 3.70344 xmax: -73.05211 ymax: 5.83076
z_range:       zmin: 0 zmax: 0
Geodetic CRS:  MAGNA-SIRGAS
First 10 features:
         Area MUNCod NOMMun ZAT UTAM                       geometry
1  1366816039      0   <NA> 810  N/A MULTIPOLYGON Z (((-74.08207...
2   824419773      0   <NA> 809  N/A MULTIPOLYGON Z (((-74.05947...
3  1823077343      0   <NA> 823  N/A MULTIPOLYGON Z (((-73.64757...
4   929997085      0   <NA> 822  N/A MULTIPOLYGON Z (((-73.54534...
5  1409253432      0   <NA> 821  N/A MULTIPOLYGON Z (((-73.59517...
6  1085719093  11001 Bogotá 796 UPR3 MULTIPOLYGON Z (((-74.08374...
7  1820638886      0   <NA> 800  N/A MULTIPOLYGON Z (((-74.30147...
8  1503956580      0   <NA> 811  N/A MULTIPOLYGON Z (((-74.49689...
9  2349669218      0   <NA> 820  N/A MULTIPOLYGON Z (((-73.99614...
10 1322014504      0   <NA> 812  N/A MULTIPOLYGON Z (((-74.4526 ...

Code

zat %>%
  st_zm() %>% 
  leaflet() %>%
  addTiles() %>%
  leaflet::addPolygons()

ZAT_INDICADORES

Next, we’re looking at the ZAT_INDICADORES.xlsx table. From referencing the codebook, I noticed the indicators in this table seem to be the “old” version. For all the ZAT included, there are no missing data, while some of the values are zeros. Here we are showing the distribution of all the indicators in the table, grouped into three categories (loosely tentatively): features about public transit, road and traffic lights.

Public transit

Code

zat_data <- read_xlsx("../../../../data/ZAT/ZAT_INDICADORES.xlsx")

zat_data_long <- zat_data %>% 
  select(-2, -3) %>% 
  pivot_longer(-ZAT, names_to = "indicator")

bus <- c('BUSTOPDENS', 'NUMRBP', 'LONGRBP', 'NUMRT', 'LONGRT')
onroad <- c('LONGMV', 'LRDENS', 'BPRDRATE', 'NUMINT', 'INTDENS')
tlight <- c('NUMTTFLIGH', 'NUMPTFLIGH')
byroad <- c('NUMTTREES', 'NUMSTTREES', 'NUMBRIDGES')
  
zat_data_long %>% 
  filter(indicator%in%all_of(bus)) %>% 
  ggplot(aes(x=value)) +
           geom_histogram(fill = "skyblue", color = "blue")+
           facet_wrap(~indicator, scales = "free") +
           theme_minimal() +
  labs(title = "Distribution of Indicators related to Bus/BRT")

Road features

Code

zat_data_long %>% 
  filter(indicator%in%all_of(onroad)) %>% 
  ggplot(aes(x=value)) +
           geom_histogram(fill = "skyblue", color = "blue")+
           facet_wrap(~indicator, scales = "free") +
           theme_minimal()+
  labs(title = "Distribution of Indicators related to Features on the Road")

Code

zat_data_long %>% 
  filter(indicator%in%all_of(byroad)) %>% 
  ggplot(aes(x=value)) +
           geom_histogram(fill = "skyblue", color = "blue", alpha = 0.7)+
           facet_wrap(~indicator, scales = "free") +
           theme_minimal()+
  labs(title = "Distribution of Indicators related to Features by the Road")

Traffic lights

Code

zat_data_long %>% 
  filter(indicator%in%all_of(tlight)) %>% 
  ggplot(aes(x=value)) +
           geom_histogram(fill = "skyblue", color = "blue", alpha = 0.7)+
           facet_wrap(~indicator, scales = "free") +
           theme_minimal()+
  labs(title = "Distribution of Indicators about Traffic Lights")

Street level

So far, the data folder included is:

Calles_datos/

This is a geo-referenced data table containing qualitative and quantitative data at the street level. Due to the size of the table, we are taking first 100 rows to take a look at the geometry:

Code

calles_100 <- readRDS("../../../../clean_data/calles/calles_100.rds")

calles_100 %>% 
  leaflet() %>% 
  addTiles() %>% 
  leaflet::addPolygons()

Apart from the geometry information, there are many variables included in the table, and the variable names do not seem to be in codebook. Skimming the full table:

Code

skim_calles <- read_csv("../../../../clean_data/calles/skim_calles.csv") %>% 
  select(-1) %>% 
  rename(col_type = skim_type,
         col_names = skim_variable) %>% 
  relocate(col_names)
         
calles <- SharedData$new(skim_calles)

reactable(calles,
  columns = list(
      col_names = colDef(
        sticky = "left",
      # Add a right border style to visually distinguish the sticky column
        style = list(borderRight = "1px solid #eee"),
        headerStyle = list(borderRight = "1px solid #eee")
    )),
  theme = reactableTheme(color = "#002b36"),
    defaultColDef = colDef(minWidth = 150),
    defaultPageSize = 12,
    striped = TRUE,
    highlight = TRUE,
    bordered = TRUE,
    resizable = TRUE)

In the initial skimming report above, the types of all 67 columns, mean/sd and distribution (for numeric columns) are shown. While the completion rate of the columns seem high, some of the columns have many zeros, with 75% percentile =0.