Built Environment, Pedestrian Injuries, and Deep Learning

From the first look, the street-level variables names in Calles_datos/ seemed a bit confusing, but after cross-referencing the codebook with CRUCES_DATOS.xlsx and variables in siniestros/, we have gained a much better understanding of what they mean. The column names in Calles_datos/and their corresponding description from the codebook are organized into codebook_calles-var.xlsx for future reference (there’re also a few uncertain variables to be confirmed/better defined).

In this post, we’ll describe the data cleaning process and explore the distributions of the variables related to road infrastructure at the street level.

Data Cleaning

As mentioned in the first look, the data included in Calles_datos/ is a geo-referenced table with 100,819 street units (rows) and 66 attributes (columns, including street id and variables).

If we look at the histogram of total area and roadway area, we’ll see there’re a lot of very small streets with area close to 0, which may not be very useful for our downstream analysis.

Code

library(dplyr)
library(ggplot2)
library(tidyr)

Code

calle %>% 
  as.data.frame() %>% 
  select(CodigoCL,total_area = area, roadway_area = A_Calzada) %>% 
  pivot_longer(-CodigoCL, names_to = "variables") %>% 
  ggplot(aes(x=value)) +
  geom_histogram(fill = "skyblue", color = "blue", binwidth = 50)+
  theme_minimal()+
  facet_wrap(~variables)+
  labs(title = "Distribution of Street-level Roadway and Total Areas")+
  theme(
    strip.background = element_rect(fill = "#dadada", color = "white"),
    strip.text = element_text(size = 13),
    panel.grid.minor = element_blank(),
    plot.title = element_text(size = 15, face = "bold")
  )

Sorting the roadway area in ascending order, and zooming in on the smallest side, we take the first 500 rows (street units) of data:

Code

calle %>% 
    as.data.frame() %>% 
    select(CodigoCL,total_area = area, roadway_area = A_Calzada) %>% 
    arrange(total_area) %>% 
    slice(1:500) %>% 
    pivot_longer(-CodigoCL, names_to = "variables") %>% 
    ggplot(aes(x=value)) +
      geom_histogram(fill = "skyblue", color = "blue", binwidth = 0.5)+
      theme_minimal()+
      facet_wrap(~variables)+
  labs(title = "Distribution of Street-level Roadway and Total Areas",
    subtitle = "for 500 street units with the smallest roadways")+
  theme(
    strip.background = element_rect(fill = "#dadada", color = "white"),
    strip.text = element_text(size = 12),
    panel.grid.minor = element_blank(),
    plot.title = element_text(size = 15, face = "bold"),
    plot.subtitle = element_text(size = 13)
  )

If we leave out the values that are too small, for example here we’re choosing anything below 5m², the 500 street units with the smallest roadways will look like this:

Code

calle %>% 
    as.data.frame() %>% 
    select(CodigoCL,total_area = area, roadway_area = A_Calzada) %>% 
    filter(roadway_area > 5 ) %>% 
    arrange(total_area) %>% 
    slice(1:500) %>% 
    pivot_longer(-CodigoCL, names_to = "variables") %>% 
    ggplot(aes(x=value)) +
    geom_histogram(fill = "skyblue", color = "blue", binwidth = 0.5)+
    theme_minimal()+
    facet_wrap(~variables)+
  labs(title = "Distribution of Street-level Roadway and Total Areas",
    subtitle = expression("for 500 street units with the smallest roadways >5"* m^2))+
  theme(
    strip.background = element_rect(fill = "#dadada", color = "white"),
    strip.text = element_text(size = 12),
    panel.grid.minor = element_blank(),
    plot.title = element_text(size = 15, face = "bold"),
    plot.subtitle = element_text(size = 13)
  )

At this point, our dataset went from 100,819 rows to 99,560 rows. I included one extra step here in an attempt to clean it further, by adding all the numeric columns for each row (except the ones for id, area-related and speed limit) and leaving out the rows that have a sum of less than 1. Considering most of the variables are counts of certain features, this step will remove the street units that have almost none of the feature that we’re studying. The resulting table now has 99,533 rows.

In addition, we’re also removing the columns that contain arbitrary IDs, repeated labels, summarization columns (that add up certain columns), and a few uncertain variables. For the character column sent_vial , we are recoding “uno” and “doble” to the numeric value 1 and 2, and set “SinD*” (no assigned direction) as numeric value 0.

Code

calle_clean <- calle %>%
  filter(A_Calzada > 5) %>%
  select(
    -c(CODIGO_IDE,
      FID_EP_IND,
      Etiquetas,
      Etiqueta_1,
      sen_v_inv,
      comp_cl,
      OID_,
      Total_gene,
      Total_ge_1)
  ) %>%
  mutate(total = rowSums(pick(
    where(is.numeric),
    -c(sent_vial,
      puente_vh,
      Puente_PT,
      velcidad,
      area,
      A_Calzada,
      CodigoCL)
  ))) %>% #99560row
  ##much much faster, see rowwise ops with dplyr
  filter(total >= 1) %>% #99533rows
  select(-total) %>%
  mutate(sent_vial = case_match(sent_vial, "uno" ~ 1, "doble" ~ 2, "SinD*" ~ 0))

Variables about Road Features

For better visualization, we’re (again, loosely) categorizing the variables into several groups, based on the common domains used with road infrastructure features. Below is the domain and the variables involved:

Road Geometry: AVE_pendie, A_Calzada, P_Ancho_Cl, A_separado, A_andenes, sum_carril;
Signage and Markings: sen_horizo , se_hor_seg , sen_vert , semaforo, X1_girar_iz , X5policiasa , X6pare , X7estaciona , X8zonas_esc , X9ceder_el;
Traffic Flow and Patterns: segme_via, velcidad, sent_vial;
Pedestrian Infrastructure: Puente_PT, A_andenes, peatonale
Cycling Infrastructure: largo_cicl, X2_ciclov.;
Public Transportation: Rutas_TRM, Rutas_SITP, Parad_SITP, Caril_SITP, X3bus_o_Tra;
Landscaping: arboles.

Code

road_geo <- c("AVE_pendie", "A_Calzada", "P_Ancho_Cl",
  "A_separado", "A_andenes","sum_carril") 
#not including "av_carrile"

sign1 <- c("sen_horizo","se_hor_seg","sen_vert", "semaforo")
sign2 <- c("X1_girar_iz", "X5policiasa", "X6pare", 
  "X7estaciona", "X8zonas_esc", "X9ceder_el")

flow <- c("segme_via", "velcidad", "sent_vial")

ped <- c("A_andenes", "X4peatonale")

bike <- c("largo_cicl","X2_ciclov.")

transit <- c("Rutas_TRM","Rutas_SITP", "Parad_SITP",  
  "Caril_SITP","X3bus_o_Tra")

landsc <- c("arboles")

Below are the distribution of the variables in each domain:

Road Geometry

Code

calle_clean_df %>% 
  select(CodigoCL, all_of(road_geo)) %>%
  pivot_longer(-CodigoCL, names_to = "variables") %>% 
  ggplot(aes(x=value)) +
  geom_histogram(fill = "skyblue", color = "blue")+
  theme_minimal()+
  facet_wrap(~variables, scales = "free")+
  labs(title = "Distribution of Street-level Road Geometry Features")+
  theme(
    strip.background = element_rect(fill = "#dadada", color = "white"),
    strip.text = element_text(size = 12),
    panel.grid.minor = element_blank(),
    plot.title = element_text(size = 15, face = "bold")
  )

Signage and Markings

Code

calle_clean_df %>% 
  select(CodigoCL, all_of(sign1)) %>%
  pivot_longer(-CodigoCL, names_to = "variables") %>% 
  ggplot(aes(x=value)) +
  geom_histogram(fill = "skyblue", color = "blue")+
  theme_minimal()+
  facet_wrap(~variables, scales = "free")+
  labs(title = "Distribution of Street-level Signage Features",
    subtitle = "Part 1")+
  theme(
    strip.background = element_rect(fill = "#dadada", color = "white"),
    strip.text = element_text(size = 12),
    panel.grid.minor = element_blank(),
    plot.title = element_text(size = 15, face = "bold")
  )

Code

calle_clean_df %>% 
  select(CodigoCL, all_of(sign2)) %>%
  pivot_longer(-CodigoCL, names_to = "variables") %>% 
  ggplot(aes(x=value)) +
  geom_histogram(fill = "skyblue", color = "blue")+
  theme_minimal()+
  facet_wrap(~variables, scales = "free")+
  labs(title = "Distribution of Street-level Signage Features",
    subtitle = "Part 2")+
  theme(
    strip.background = element_rect(fill = "#dadada", color = "white"),
    strip.text = element_text(size = 12),
    panel.grid.minor = element_blank(),
    plot.title = element_text(size = 15, face = "bold")
  )

Traffic Flow and Patterns

Code

calle_clean_df %>% 
  select(CodigoCL, all_of(flow)) %>%
  pivot_longer(-CodigoCL, names_to = "variables") %>% 
  ggplot(aes(x=value)) +
  geom_histogram(fill = "skyblue", color = "blue", binwidth = 1)+
  theme_minimal()+
  facet_wrap(~variables, scales = "free")+
  labs(title = "Distribution of Street-level Features on Traffic Flow")+
  theme(
    strip.background = element_rect(fill = "#dadada", color = "white"),
    strip.text = element_text(size = 12),
    panel.grid.minor = element_blank(),
    plot.title = element_text(size = 15, face = "bold")
  )

Pedestrian and Cycling Infrastructure

Code

calle_clean_df %>% 
  select(CodigoCL, all_of(ped), all_of(bike)) %>%
  pivot_longer(-CodigoCL, names_to = "variables") %>% 
  ggplot(aes(x=value)) +
  geom_histogram(fill = "skyblue", color = "blue")+
  theme_minimal()+
  facet_wrap(~variables, scales = "free")+
  labs(title = "Distribution of Street-level Pedestrian and Cycling Features")+
  theme(
    strip.background = element_rect(fill = "#dadada", color = "white"),
    strip.text = element_text(size = 12),
    panel.grid.minor = element_blank(),
    plot.title = element_text(size = 15, face = "bold"))

Landscaping

Code

calle_clean_df %>% 
  select(CodigoCL, all_of(landsc)) %>%
  pivot_longer(-CodigoCL, names_to = "variables") %>% 
  ggplot(aes(x=value)) +
  geom_histogram(fill = "skyblue", color = "blue")+
  theme_minimal()+
  facet_wrap(~variables, scales = "free")+
  labs(title = "Distribution of ST-level Landscaping Features")+
  theme(
    strip.background = element_rect(fill = "#dadada", color = "white"),
    strip.text = element_text(size = 11),
    panel.grid.minor = element_blank(),
    plot.title = element_text(size = 12, face = "bold", hjust = 0))

Multicorrelation between Variables

Code

cor_mx <- calle_clean_df %>% 
  select(-CodigoCL, -area, -Puente_PT, -puente_vh) %>% 
  cor(.)

cor_mx[lower.tri(cor_mx, diag = TRUE)] <- NA