Code
library(dplyr)
library(sf)
library(ClustGeo)
library(ggplot2)
library(tidyr)
library(patchwork)
with ZAT indicators in Bogotá
Heli Xu
March 14, 2024
Previously we have used finite mixture modeling to cluster the neighborhoods (ZATs) across all the ZAT-level indicators on road infrastructures. In this post, we’ll explore another clustering approach, hierarchical clustering, taking into account the spatial relationships of the ZAT units (in addition to the indicators).
First, we’ll look at the hierarchical clustering output with the indicators only. We can use a dissimilarity matrix D0
calculated with the indicators.
zat_std2 <- readRDS("../../../../clean_data/ZAT/zat_std2.rds")
D0 <- dist(zat_std2)
tree <- hclustgeo(D0)
plot(tree, hang = -1, label = FALSE,
xlab = "", sub = "",
main = "Ward dendrogram with D0 only")
rect.hclust(tree, k = 4, border = 1:4)
legend("topright", legend = paste("cluster", 1:4),
fill = 1:4, bty = "n", border = "white")
We can use this dendrogram to illustrate the arrangement of the clusters produced by hierarchical clustering, and the height at which any two clusters are joined together represents the distance or dissimilarity between those clusters. Depending on the linkage method, the dissimilarity is measured differently:
For this post, we’ll be using Ward’s Method. And from the dendrogram, we are choosing k = 4 as the component number, as shown above in the different colored rectangles. And the resulting partition looks like this:
pal <- c("#225ea8","#41b6c4","#a1dab4","#fecb3e")
source("../../../../functions/cluster_plot.R")
map <- ggplot()+
geom_sf(data = hclust_geo, fill = pal[hclust_geo$clus])
(cluster_plot(zat_hclust) | map ) +
plot_annotation('Hierarchical Clustering with Indicators only',
subtitle = 'ZAT level, Bogotá',
theme=theme(plot.title=element_text(size=14, face = "bold", hjust=0.5),
plot.subtitle = element_text(size = 10, face = "bold", hjust = 0.5))) +
plot_layout(width = c(1, 1.5), height = unit(15, "cm"))
The first way to include the spatial constraint is through the geographic distances.
Distance here is calculated between polygons by st_distances
, which is determined by the minimum distance between any two points on the boundary of non-overlapping polygons. For overlapping polygons, the distance is 0.
We can do that by including another matrix D1
with the information of the distances between ZAT units, and setting a mixing parameter alpha
in [0, 1] to set the importance of D0
and D1
.
When alpha
=0 the D1
is not taken into account, and when alpha
=1 D0
is not taken into account. To choose an appropriate alpha
, we can plot the homogeneity (proportion of explained inertia) of each matrix (Q0
and Q1
) using a range of different values of alpha
and a given number of cluster (in this case k=4).
We are choosing alpha
=0.35 here where Q0
and Q1
both have relatively high homogeneity. Then we can perform the clustering with geographic distances constraint:
As a result, we have a slightly different clustering output:
map2 <- ggplot()+
geom_sf(data = clustgeo_geo, fill = pal[clustgeo_geo$clus])
(cluster_plot(zat_clustgeo) | map2 ) +
plot_annotation(
'Hierarchical Clustering with Indicators and Spatial Constraint',
subtitle = 'ZAT level, Bogotá',
theme=theme(
plot.title=element_text(size=14, face = "bold", hjust=0.5),
plot.subtitle = element_text(size = 10, face = "bold", hjust = 0.5)))+
plot_layout(width = c(1, 1.5), height = unit(15, "cm"))
Another way to consider spatial constraint is through the neighboring ZAT units. We can do that by generating another dissimilarity matrix D1
that reflects whether two ZAT units have contiguous boundaries (sharing one or more boundary point).
We can use a similar strategy to determine the mixing parameter alpha
.
In this case, alpha
= 0.2 is chosen for the clustering process.
Here is how the results look like:
map3 <- ggplot()+
geom_sf(data = clustnb_geo, fill = pal[clustnb_geo$clus])
(cluster_plot(zat_clustnb) | map3 ) +
plot_annotation(
'Hierarchical Clustering with Indicators and Neighborhood Constraint',
subtitle = 'ZAT level, Bogotá',
theme=theme(plot.title=element_text(size=14, face = "bold", hjust=0.5),
plot.subtitle = element_text(size = 10, face = "bold", hjust = 0.5))) +
plot_layout(width = c(1, 1.5), height = unit(15, "cm"))
In contrast to geographical distances, neighborhood distances take into account the network between the ZAT units and better reflects the actual distances between neighborhoods (especially when there’s large water body or mountain separating some ZAT units – e.g. Boston).
We can first generate the neighborhoods (similar to previous section on neighboring status), and the neighborhood distances we need are the line(s) connecting the neighborhoods, as shown below (queen contiguity):
Next, we’ll construct a data frame for the network with columns representing the origin (from), destination (to) and the corresponding distances to create a matrix of distances by the network for the clustering process (D1
).
Similarly, we need to determine a mixing parameter alpha
by the homogeneity plot:
library(cppRouting)
library(data.table)
# data frame
n = length(dist_nb)
res = data.table(from = integer(), to = integer(), dist = numeric())
for(i in seq_len(n)){
res = rbind(res, data.table(from = i, to = nb[[i]], dist = dist_nb[[i]]))
}
graph <- makegraph(res, directed = F)
# distance matrix
dist_link <- get_distance_matrix(Graph=graph,
from = unique(res$from),
to = unique(res$to))
# clustering
D1 <- as.dist(dist_link)
We’ll choose alpha
= 0.45 in this case for clustering.
As a result, we have the following neighborhood profiles:
map4 <- ggplot()+
geom_sf(data = clustnbdist_geo, fill = pal[clustnbdist_geo$clus])
(cluster_plot(zat_clustnbdist) | map4 ) +
plot_annotation('Hierarchical Clustering with Indicators and Neighborhood Distances Constraint',
subtitle = 'ZAT level, Bogotá',
theme=theme(plot.title=element_text(size=13, face = "bold", hjust=0.5),
plot.subtitle = element_text(size = 10, face = "bold", hjust = 0.5)))+
plot_layout(widths = c(1,1.5), heights = unit(15, units = "cm"))