Code
library(dplyr)
library(ggplot2)
library(tidyr)
library(purrr)
library(tibble)
library(flexmix)
library(ggbeeswarm)
library(reactable)
library(crosstalk)
Finite mixture modeling with ZAT-level data
Heli Xu
February 7, 2024
After visualizing ZAT-level variable distributions spatially on maps and statistically through histograms, we’ll now try using finite mixture modeling to fit our data to understand the neighborhood profiles.
In the previous post, we used the raw data to explore the variable distributions. To get the data ready for modeling, we’ll need to clean and standardize the data.
The cleaning step is straightforward, where we leave out the ZAT units with zero road network length (LRDENS
, in meters) and/or intersections. 4 ZAT units don’t have measurable road network length or intersections, and another 9 ZAT units don’t have intersections, which reduce the dataset from 919 rows (ZAT units) to 906 rows.
To standardize the data, we’re using areakm2
to normalize the countinuous and count data that have not already been area-adjusted (eg., NUMRBP
, NUMTTFLIGH
, NUMSTTREES
), and taking a natural log of the length measurements in meters (eg., LRDENS
, LONGRBP
), with the bike lane rate transformed into log length of bike lanes.
zat_std <- zat_raw_data %>%
filter(LRDENS > 0,
NUMINT >0) %>%
mutate(
road_length_log = log(LRDENS),
st_4ln_length_log = log(LONGMV),
bikelane_per_km2 = BPRDRATE * LRDENS,
bikelane_m_log = case_when(bikelane_per_km2 > 0 ~ log(bikelane_per_km2),
.default = bikelane_per_km2),
sttree_per_km2 = NUMSTTREES / areakm2,
bridg_per_km2 = NUMBRIDGES / areakm2,
trlight_per_km2 = NUMTTFLIGH / areakm2,
numrbp_per_km2 = NUMRBP/areakm2,
numrt_per_km2 = NUMRT/areakm2,
longrbp_per_km2 = LONGRBP / areakm2,
longrt_per_km2 = LONGRT / areakm2,
bus_length_log = case_when(longrbp_per_km2 > 0 ~ log(longrbp_per_km2),
.default = longrbp_per_km2),
brt_length_log = case_when(longrt_per_km2 > 0 ~ log(longrt_per_km2),
.default = longrt_per_km2)
)
The histograms of the distribution of the original variables, standardized variables and their mean/variance can be viewed on this interactive shiny dashboard. For downstream modeling, we’ll only include the standardized variables.
First we’ll explore the correlation among these variables.
zat_cor <- zat_std2 %>% select(-ZAT) %>% cor()
zat_cor[lower.tri(zat_cor, diag = FALSE)] <- NA
# >0.6 related variables
cor_react <- zat_cor %>% as.data.frame() %>%
rownames_to_column(var = "var")
cor_react2 <- SharedData$new(cor_react)
reactable(cor_react2,
columns = list(
var = colDef(
sticky = "left",
# Add a right border style to visually distinguish the sticky column
style = list(borderRight = "1px solid #eee"),
headerStyle = list(borderRight = "1px solid #eee")
)),
theme = reactableTheme(color = "#002b36"),
defaultColDef = colDef(minWidth = 150),
defaultPageSize = 11,
striped = TRUE,
highlight = TRUE,
bordered = TRUE,
resizable = TRUE
)
The variables that show strong correlation are:
# A tibble: 3 × 3
var1 var2 cor
<chr> <chr> <dbl>
1 road_length_log st_4ln_length_log 0.617
2 road_length_log bus_length_log 0.635
3 numrt_per_km2 brt_length_log 0.661
The length of toad network is shown to be correlated with both the length of streets with 4 or more lanes and the length of bus routes. Although the number of BRT routes are expectedly correlated with the length of the BRT routes, that’s not the case for bus routes. Based on this, we’ll exclude road_length_log
and numrt_per_km2
in our modeling.
Since the clean and standardized data are all continuous data, despite some of them being zero-inflated, it would still be reasonable to choose a Gaussian model for clustering. We’ll also make an attempt to categorize the variables into two domains and fit the model for each domain and both domains together.
st_4ln_length_log
, bikelane_m_log
, trlight_per_km2
, sttree_per_km2
, bridg_per_km2
;BUSTOPDENS
, bus_length_log
, brt_length_log
, numrbp_per_km2
.zat_fmm <- zat_std2 %>% select(-road_length_log, -numrt_per_km2)
street <- zat_fmm %>%
select(st_4ln_length_log, bikelane_m_log, trlight_per_km2, sttree_per_km2, bridg_per_km2) %>%
colnames()
transportation <- zat_fmm %>%
select(BUSTOPDENS, bus_length_log, brt_length_log, numrbp_per_km2) %>%
colnames()
all <- zat_fmm %>%
select(-ZAT) %>%
colnames()
Using the flexmix
R package, we’ll fit the model using a range of component numbers, from 1 to 7, and use Bayesian Information Criteria (BIC) values to select the optimal number of clusters (usually at smallest BIC value or when the BIC decrease < 1.5%). For example, the BIC values for all variables (both domains) look like this with k (component number) going from 1 to 7 (in this case k=3 is selected as the optimal number of clusters):
#normal
source("../../../../functions/fmm_normal.R")
street_mix<- fmm_normal(zat_fmm, street, 1:7)
transp_mix <- fmm_normal(zat_fmm,transportation, 1:7)
all_mix <- fmm_normal(zat_fmm, all, 1:7)
source("../../../../functions/elbow_bic.R")
elbow_bic(all_mix, 1:7, title = "Elbow Plot for BIC Values (all variables)")
Note that the modeling result and their BIC values could vary each time we run the stepFlexmix()
, because the Expectation-Maximization (EM) algorithm involves random initialization of parameters.
After determining the cluster numbers, we could visualize the ZAT-level built environment profiles across different variables.
#beeswarm
source("../../../../functions/plot_profile.R")
plot_profile(zat_fmm, street_mix, street, title = "Neighborhood Profiles: Street Desgin")
plot_profile(zat_fmm, transp_mix, transportation, title = "Neighborhood Profiles: Transportation")
plot_profile(zat_fmm, all_mix, all, title = "Neighborhood Profiles: All Indicators")
From the fitted models, we can also extract the posterior probabilities of cluster assignment for each ZAT unit. For example, in the model where all the variables are included, the posterior probabilities for each ZAT unit can be visualized in the plot below:
mix_best <- getModel(all_mix, "BIC")
post_p <- posterior(mix_best) %>% as.data.frame() %>%
rownames_to_column(var = "ZAT") %>%
mutate(order = V1-V3) %>%
pivot_longer(c(V1, V2, V3), names_to = "cluster", values_to = "prob")
ggplot(post_p, aes(x = reorder(ZAT, order), y = prob, fill = cluster)) +
geom_bar(stat = "identity", position = "fill", width = 1) + # Adjust the width to fill the space
scale_fill_brewer(palette = "YlGnBu") +
coord_flip() + # Rotate the plot to make it look like a heatmap
scale_x_discrete(expand = c(0, 0)) + # Remove space between groups
labs(x = "ZAT Units", y = "Posterior Probabilities", title = "Posterior Probabilities of Cluster Assignment of Each ZAT Unit") +
theme(axis.text.y = element_blank(),
axis.line.y = element_blank(),
title = element_text(face = "bold", size = 13),
axis.title.x = element_text(size = 12),
axis.text = element_text(size = 10),
legend.text = element_text(size = 10))