In this report, my goal is to get familiar with the variables (indicators), what they represent (according to codebook), how complete the datasets are, and potentially examine the distribution of the variables.
ZAT level
At ZAT level, two files (folders) are included (along with the codebook).
ZAT.zip: boundary files – a table of 1141 rows, 6 cols (each row being a ZAT unit);
Intersections/ZAT_INDICADORES.xlsx: a table of 919 rows, 18 cols (each row being a ZAT unit).
ZAT folder
To begin with, I looked at the ZAT folder, which contains shapefiles that define boundaries of each ZAT zone.
Reading layer `ZAT' from data source
`D:\LocalGitHub\BEPIDL\data\ZAT\ZAT_geo\ZAT.shp' using driver `ESRI Shapefile'
Simple feature collection with 1141 features and 5 fields
Geometry type: MULTIPOLYGON
Dimension: XY, XYZ
Bounding box: xmin: -74.88439 ymin: 3.70344 xmax: -73.05211 ymax: 5.83076
z_range: zmin: 0 zmax: 0
Geodetic CRS: MAGNA-SIRGAS
Code
zat
Simple feature collection with 1141 features and 5 fields
Geometry type: MULTIPOLYGON
Dimension: XY, XYZ
Bounding box: xmin: -74.88439 ymin: 3.70344 xmax: -73.05211 ymax: 5.83076
z_range: zmin: 0 zmax: 0
Geodetic CRS: MAGNA-SIRGAS
First 10 features:
Area MUNCod NOMMun ZAT UTAM geometry
1 1366816039 0 <NA> 810 N/A MULTIPOLYGON Z (((-74.08207...
2 824419773 0 <NA> 809 N/A MULTIPOLYGON Z (((-74.05947...
3 1823077343 0 <NA> 823 N/A MULTIPOLYGON Z (((-73.64757...
4 929997085 0 <NA> 822 N/A MULTIPOLYGON Z (((-73.54534...
5 1409253432 0 <NA> 821 N/A MULTIPOLYGON Z (((-73.59517...
6 1085719093 11001 Bogotá 796 UPR3 MULTIPOLYGON Z (((-74.08374...
7 1820638886 0 <NA> 800 N/A MULTIPOLYGON Z (((-74.30147...
8 1503956580 0 <NA> 811 N/A MULTIPOLYGON Z (((-74.49689...
9 2349669218 0 <NA> 820 N/A MULTIPOLYGON Z (((-73.99614...
10 1322014504 0 <NA> 812 N/A MULTIPOLYGON Z (((-74.4526 ...
Code
zat %>%st_zm() %>%leaflet() %>%addTiles() %>% leaflet::addPolygons()
ZAT_INDICADORES
Next, we’re looking at the ZAT_INDICADORES.xlsx table. From referencing the codebook, I noticed the indicators in this table seem to be the “old” version. For all the ZAT included, there are no missing data, while some of the values are zeros. Here we are showing the distribution of all the indicators in the table, grouped into three categories (loosely tentatively): features about public transit, road and traffic lights.
Public transit
Code
zat_data <-read_xlsx("../../../../data/ZAT/ZAT_INDICADORES.xlsx")zat_data_long <- zat_data %>%select(-2, -3) %>%pivot_longer(-ZAT, names_to ="indicator")bus <-c('BUSTOPDENS', 'NUMRBP', 'LONGRBP', 'NUMRT', 'LONGRT')onroad <-c('LONGMV', 'LRDENS', 'BPRDRATE', 'NUMINT', 'INTDENS')tlight <-c('NUMTTFLIGH', 'NUMPTFLIGH')byroad <-c('NUMTTREES', 'NUMSTTREES', 'NUMBRIDGES')zat_data_long %>%filter(indicator%in%all_of(bus)) %>%ggplot(aes(x=value)) +geom_histogram(fill ="skyblue", color ="blue")+facet_wrap(~indicator, scales ="free") +theme_minimal() +labs(title ="Distribution of Indicators related to Bus/BRT")
Road features
Code
zat_data_long %>%filter(indicator%in%all_of(onroad)) %>%ggplot(aes(x=value)) +geom_histogram(fill ="skyblue", color ="blue")+facet_wrap(~indicator, scales ="free") +theme_minimal()+labs(title ="Distribution of Indicators related to Features on the Road")
Code
zat_data_long %>%filter(indicator%in%all_of(byroad)) %>%ggplot(aes(x=value)) +geom_histogram(fill ="skyblue", color ="blue", alpha =0.7)+facet_wrap(~indicator, scales ="free") +theme_minimal()+labs(title ="Distribution of Indicators related to Features by the Road")
Traffic lights
Code
zat_data_long %>%filter(indicator%in%all_of(tlight)) %>%ggplot(aes(x=value)) +geom_histogram(fill ="skyblue", color ="blue", alpha =0.7)+facet_wrap(~indicator, scales ="free") +theme_minimal()+labs(title ="Distribution of Indicators about Traffic Lights")
Street level
So far, the data folder included is:
Calles_datos/
This is a geo-referenced data table containing qualitative and quantitative data at the street level. Due to the size of the table, we are taking first 100 rows to take a look at the geometry:
Apart from the geometry information, there are many variables included in the table, and the variable names do not seem to be in codebook. Skimming the full table:
In the initial skimming report above, the types of all 67 columns, mean/sd and distribution (for numeric columns) are shown. While the completion rate of the columns seem high, some of the columns have many zeros, with 75% percentile =0.