Neighborhood Built Environment Profiles in Bogotá

Finite mixture modeling with ZAT-level data

Author

Heli Xu

Published

February 7, 2024

After visualizing ZAT-level variable distributions spatially on maps and statistically through histograms, we’ll now try using finite mixture modeling to fit our data to understand the neighborhood profiles.

Code
library(dplyr)
library(ggplot2)
library(tidyr)
library(purrr)
library(tibble)
library(flexmix)
library(ggbeeswarm)
library(reactable)
library(crosstalk)

Data preparation

In the previous post, we used the raw data to explore the variable distributions. To get the data ready for modeling, we’ll need to clean and standardize the data.

The cleaning step is straightforward, where we leave out the ZAT units with zero road network length (LRDENS, in meters) and/or intersections. 4 ZAT units don’t have measurable road network length or intersections, and another 9 ZAT units don’t have intersections, which reduce the dataset from 919 rows (ZAT units) to 906 rows.

To standardize the data, we’re using areakm2 to normalize the countinuous and count data that have not already been area-adjusted (eg., NUMRBP, NUMTTFLIGH, NUMSTTREES), and taking a natural log of the length measurements in meters (eg., LRDENS, LONGRBP), with the bike lane rate transformed into log length of bike lanes.

Code
zat_std <- zat_raw_data %>%
  filter(LRDENS > 0, 
         NUMINT >0) %>% 
  mutate(
    road_length_log = log(LRDENS),
    st_4ln_length_log = log(LONGMV),
    bikelane_per_km2 = BPRDRATE * LRDENS,
    bikelane_m_log = case_when(bikelane_per_km2 > 0 ~ log(bikelane_per_km2),
                              .default = bikelane_per_km2),
    sttree_per_km2 = NUMSTTREES / areakm2,
    bridg_per_km2 = NUMBRIDGES / areakm2,
    trlight_per_km2 = NUMTTFLIGH / areakm2,
    numrbp_per_km2 = NUMRBP/areakm2,
    numrt_per_km2 = NUMRT/areakm2,
    longrbp_per_km2 = LONGRBP / areakm2,
    longrt_per_km2 = LONGRT / areakm2,
    bus_length_log = case_when(longrbp_per_km2 > 0 ~ log(longrbp_per_km2),
                               .default = longrbp_per_km2),
    brt_length_log = case_when(longrt_per_km2 > 0 ~ log(longrt_per_km2),
                               .default = longrt_per_km2)
  ) 

The histograms of the distribution of the original variables, standardized variables and their mean/variance can be viewed on this interactive shiny dashboard. For downstream modeling, we’ll only include the standardized variables.

Code
zat_std2 <- zat_std %>% 
  select(ZAT, BUSTOPDENS, road_length_log, st_4ln_length_log, bikelane_m_log, 
         sttree_per_km2, bridg_per_km2, trlight_per_km2, numrbp_per_km2,
         numrt_per_km2, bus_length_log, brt_length_log) 

Correlation among variables

First we’ll explore the correlation among these variables.

Code
zat_cor <- zat_std2 %>% select(-ZAT) %>% cor()

zat_cor[lower.tri(zat_cor, diag = FALSE)] <- NA
# >0.6 related variables

cor_react <- zat_cor %>% as.data.frame() %>% 
  rownames_to_column(var = "var")

cor_react2 <- SharedData$new(cor_react)

reactable(cor_react2, 
    columns = list(
      var = colDef(
        sticky = "left",
      # Add a right border style to visually distinguish the sticky column
        style = list(borderRight = "1px solid #eee"),
        headerStyle = list(borderRight = "1px solid #eee")
    )),
  theme = reactableTheme(color = "#002b36"),
    defaultColDef = colDef(minWidth = 150),
    defaultPageSize = 11,
    striped = TRUE,
    highlight = TRUE,
    bordered = TRUE,
    resizable = TRUE
  )

The variables that show strong correlation are:

Code
as.data.frame(zat_cor) %>%
  rownames_to_column("var1") %>% 
  pivot_longer(cols = -var1, names_to = "var2", values_to = "cor") %>% 
  filter(cor >0.6 & cor != 1) %>% 
  print()
# A tibble: 3 × 3
  var1            var2                cor
  <chr>           <chr>             <dbl>
1 road_length_log st_4ln_length_log 0.617
2 road_length_log bus_length_log    0.635
3 numrt_per_km2   brt_length_log    0.661

The length of toad network is shown to be correlated with both the length of streets with 4 or more lanes and the length of bus routes. Although the number of BRT routes are expectedly correlated with the length of the BRT routes, that’s not the case for bus routes. Based on this, we’ll exclude road_length_log and numrt_per_km2 in our modeling.

Model-based clustering

Since the clean and standardized data are all continuous data, despite some of them being zero-inflated, it would still be reasonable to choose a Gaussian model for clustering. We’ll also make an attempt to categorize the variables into two domains and fit the model for each domain and both domains together.

  • Street design domain: st_4ln_length_log, bikelane_m_log, trlight_per_km2, sttree_per_km2, bridg_per_km2;
  • Transportation domain: BUSTOPDENS, bus_length_log, brt_length_log, numrbp_per_km2.
Code
zat_fmm <- zat_std2 %>% select(-road_length_log, -numrt_per_km2)

street <- zat_fmm %>% 
  select(st_4ln_length_log, bikelane_m_log, trlight_per_km2, sttree_per_km2, bridg_per_km2) %>% 
  colnames()

transportation <- zat_fmm %>% 
  select(BUSTOPDENS, bus_length_log, brt_length_log, numrbp_per_km2) %>% 
  colnames()

all <- zat_fmm %>% 
  select(-ZAT) %>% 
  colnames()

Using the flexmix R package, we’ll fit the model using a range of component numbers, from 1 to 7, and use Bayesian Information Criteria (BIC) values to select the optimal number of clusters (usually at smallest BIC value or when the BIC decrease < 1.5%). For example, the BIC values for all variables (both domains) look like this with k (component number) going from 1 to 7 (in this case k=3 is selected as the optimal number of clusters):

Code
#normal
source("../../../../functions/fmm_normal.R")

street_mix<- fmm_normal(zat_fmm, street, 1:7)
transp_mix <- fmm_normal(zat_fmm,transportation, 1:7)
all_mix <- fmm_normal(zat_fmm, all, 1:7)

source("../../../../functions/elbow_bic.R")
elbow_bic(all_mix, 1:7, title = "Elbow Plot for BIC Values (all variables)")

Note that the modeling result and their BIC values could vary each time we run the stepFlexmix(), because the Expectation-Maximization (EM) algorithm involves random initialization of parameters.

Neighborhood profiles

After determining the cluster numbers, we could visualize the ZAT-level built environment profiles across different variables.

Code
#beeswarm
source("../../../../functions/plot_profile.R")
plot_profile(zat_fmm, street_mix, street, title = "Neighborhood Profiles: Street Desgin")
plot_profile(zat_fmm, transp_mix, transportation, title = "Neighborhood Profiles: Transportation")
plot_profile(zat_fmm, all_mix, all, title = "Neighborhood Profiles: All Indicators")
  • Street design domain:

  • Transportation domain:

  • All variables:

Posterior probabilities

From the fitted models, we can also extract the posterior probabilities of cluster assignment for each ZAT unit. For example, in the model where all the variables are included, the posterior probabilities for each ZAT unit can be visualized in the plot below:

Code
mix_best <- getModel(all_mix, "BIC")

post_p <- posterior(mix_best) %>% as.data.frame() %>% 
  rownames_to_column(var = "ZAT") %>% 
  mutate(order = V1-V3) %>% 
  pivot_longer(c(V1, V2, V3), names_to = "cluster", values_to = "prob")

ggplot(post_p, aes(x = reorder(ZAT, order), y = prob, fill = cluster)) +
  geom_bar(stat = "identity", position = "fill", width = 1) +  # Adjust the width to fill the space
  scale_fill_brewer(palette = "YlGnBu") +
  coord_flip() +  # Rotate the plot to make it look like a heatmap
  scale_x_discrete(expand = c(0, 0)) +  # Remove space between groups
  labs(x = "ZAT Units", y = "Posterior Probabilities", title = "Posterior Probabilities of Cluster Assignment of Each ZAT Unit") +
  theme(axis.text.y = element_blank(),
        axis.line.y = element_blank(),
        title = element_text(face = "bold", size = 13),
        axis.title.x = element_text(size = 12),
        axis.text = element_text(size = 10),
        legend.text = element_text(size = 10))