Extending R's GTFS abilities with simple features

June 2, 2018 - 10 minutes
Two functions to easily transition from library(gtfsr) into library(sf) for tidyier transit analysis.
Civic Data sf gtfsr tidyverse

TLDR

Two similar functions for building sf objects of the routes and stops within a gtfs objects.

General Transit Feed Specification

Originally named Google Transit Feed Specification, GTFS was developed in 2005 by Chris Harrelson (a Googler at the time), to make trip planning on civic transit systems easier.

GTFS format is a standardized set of comma separated txt files, with specific names and fields. To be flexible (i.e. cover a lot of public transit options), not all files/fields are required. Here is a comprehensive list breakdown.

library(gtfsr)

The rOpenSci project’s library(gtfsr) is designed to make reading and visualizing GTFS data fast and easy. You can install the package from GitHub, via devtools::install_github("ropensci/gtfsr").

gtfsr::import_gtfs()

This function does a great job handling the flexible GTFS structure. Here I’m reading in the local Charlottesville Area Transit’s (CAT) GTFS feed.

The zipped data is available in this Civic Innovation Day repo.

library(gtfsr)

cat <- gtfsr::import_gtfs("https://github.com/Smart-Cville/CID-2018-Regional-Transit-Challenge/blob/b6c714ec190f8843d6aa154fc74ed7be3bd5307f/data/2017_08_CharlottesvilleAreaTransit.zip?raw=true")
## [1] "agency.txt"         "calendar_dates.txt" "calendar.txt"      
## [4] "routes.txt"         "shapes.txt"         "stop_times.txt"    
## [7] "stops.txt"          "trips.txt"
# lots of verbose warnings that I turned off for kniting porpoises

The function returns to us a special S3 class list object, gtfs, that has a data frame, for each of the available files.

class(cat)
## [1] "gtfs"
names(cat)
## [1] "agency_df"         "calendar_dates_df" "calendar_df"      
## [4] "routes_df"         "shapes_df"         "stop_times_df"    
## [7] "stops_df"          "trips_df"

gtfsr::map_gtfs()

A high-level mapping function built with library(leaflet).

In this specific case, attempting to plot all routes results in an error about missing the agency_id field. So my work around here is to explicitly specify all routes, which is easy via %>% dot.

library(magrittr)

cat %>% map_gtfs() # errors out
## Error in slice_impl(.data, dots): Evaluation error: object 'agency_id' not found.
# work around
cat %>% map_gtfs(., route_ids = .$routes_df$route_id,
                 route_colors = paste0("#", .$routes_df$route_color)) %>%
    # needed for web sharing only
    widgetframe::frameWidget() # isolates leaflet:css, to prevent messing with webpage:css

You can compare the library(gtfsr) results against the current CAT app and static map.

Extending to sf

The gtfsr package does a great job reading and showing data, but if you want to join in additional data sources or customize plots beyond colors or routes, you are out of luck.

Simple Features is making big waves in the r-spatial community since library(sf)(https://github.com/r-spatial/sf) introduced a new class of geometry enriched data frames.

Extracting routes into routes_sf

Because GTFS data is distributed across multiple files, we must pick apart the list of data frames to pull out the critical pieces.

In order to reconstruct a plot similar to map_gtfs(), we need:

  1. key-value pairs from cat$trips_df
  2. route meta information from cat$routes_df
  3. shapes as Lon/lat coordinates from cat$shapes_df
  • Caution: the following function is long *
library(sf)
library(tidyverse)

gtfs_routes_sf <- function(gtfs) {
    
    ## gather key-values first ----
    
    # trips_df has route_id:shape_id
    shape_key <- gtfs$trips_df %>%
        select(route_id, shape_id) %>%
        unique()

    # routes_df has route_id:route_name
    route_key <- gtfs$routes_df %>%
        select(route_id, route_short_name) %>%
        mutate(route_short_name = paste("route", route_short_name)) %>%
        inner_join(shape_key)
    
    # check for colors :)
    if ( !is.null(gtfs$routes_df$route_color) ) { # extract if they exist
        route_key %<>% inner_join(select(gtfs$routes_df, route_color, route_id) )
    }
    else { # planB: build a pal from my favorite pallette 'd3'
        route_key %<>% mutate(route_color = rep(ggsci::pal_d3()(10),
                                                length.out = nrow(route_key)))
    }
    
    ## build the sf object ----
    
    # exctract lon/lat values as matrix to build linestrings for each "shape_id"
    sfc <- gtfs$shapes_df %>% # long data frame
        split(.$shape_id) %>% # list of shorted data framee, one per shape
        map(~ select(., shape_pt_lon, shape_pt_lat) %>% # order maters, lon-1st lat-2nd
            as.matrix %>% # coherce for st_linestrings happiness
            st_linestring) %>%
        st_sfc(crs = 4326) # bundle all shapes into a collection

    # add collection on and convert to sf
    unique(gtfs$shapes_df$shape_id) %>%
        sort() %>% # sort to match with names(sfc); split()'s factor-cohercion alpha sorts
        st_sf(shape_id = ., geometry = sfc) %>%
        inner_join(route_key)
    # plots fine ^^
    
    # st_sf(route_key, geometry = sfc)
    # doesn't plot ^^ and I can't explain it
}

OK that a lot of code, but does it actually work?

cat_routes_sf <- gtfs_routes_sf(cat)

head(cat_routes_sf)
## Simple feature collection with 6 features and 4 fields
## geometry type:  LINESTRING
## dimension:      XY
## bbox:           xmin: -78.52748 ymin: 37.99908 xmax: -78.45434 ymax: 38.03335
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs
##   shape_id route_id route_short_name route_color
## 1    80859    15288          route 1      f47721
## 2    80860    15288          route 1      f47721
## 3    81122    15288          route 1      f47721
## 4    81149    15289          route 2      9c4e0f
## 5    81150    15289          route 2      9c4e0f
## 6    81151    15290          route 3      2b328a
##                         geometry
## 1 LINESTRING (-78.45512 38.02...
## 2 LINESTRING (-78.48065 38.00...
## 3 LINESTRING (-78.48065 38.00...
## 4 LINESTRING (-78.4869 38.013...
## 5 LINESTRING (-78.4869 38.013...
## 6 LINESTRING (-78.52402 37.99...

Yea it does.

Now all we need to do is viz it and since library(leaflet) already works out of the box with sf objects, it’s cake.

library(leaflet)

leaflet(cat_routes_sf) %>%
    addTiles() %>%
    addPolylines(color = ~paste0("#", route_color),
                 label = ~as.character(route_short_name)) %>%
    addLegend(colors = ~unique(paste0("#", route_color)),
              labels = ~unique(route_short_name)) %>%
    widgetframe::frameWidget()

Rinse and repeat for stops_sf()

The steps are slightly different but the pattern is the same (and it’s still long).

gtfs_stops_sf <- function(gtfs) {
    shape_key <- gtfs$trips_df %>%
        select(trip_id, route_id, shape_id) %>%
        unique()
    
    # stop_times_df also has stop sequence and arrive/depart time for specific stops
    stop_key <- gtfs$stop_times_df %>%
        select(trip_id, stop_id) %>%
        unique() %>%
        inner_join(shape_key) %>% # one stop is on multiple routes
        # need to pair down
        arrange(route_id) %>% # use route_id as tiebreaker (for now)
        group_by(stop_id) %>% # group_by() to stop_id 
        slice(1) # to slice() out the first row
    
    if ( !is.null(gtfs$routes_df$route_color) ) {
        stop_key %<>% inner_join(select(gtfs$routes_df, route_color, route_id)) }
    else {stop_key %<>% mutate(route_color = rep(ggsci::pal_d3()(10), length.out = nrow(route_key))) }
    
    stops_sfc <- gtfs$stops_df %>%
        split(.$stop_id) %>%
        map(~select(., stop_lon, stop_lat) %>%
                unlist() %>%
                st_point() ) %>% # point instead of linestring
        st_sfc()
    
    st_sf(stop_key, geometry = stops_sfc) %>%
        inner_join(gtfs$stops_df)
}

Now, make use and make stops_sf object.

cat_stops_sf <- gtfs_stops_sf(cat)

head(cat_stops_sf)
## Simple feature collection with 6 features and 15 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: -78.51638 ymin: 38.02906 xmax: -78.45877 ymax: 38.04909
## epsg (SRID):    NA
## proj4string:    NA
##   trip_id stop_id route_id shape_id route_color stop_code
## 1 2651606 1040662    15296    84777      ed228e     20096
## 2 2651941 1041094    15297    84840      3dc4e5     20106
## 3 2562851  701531    15294    82000      752b8f     10472
## 4 2562851  701532    15294    82000      752b8f     10660
## 5 2787405  701533    15300    88423      086d38     10687
## 6 2546273  701534    15288    80859      f47721     11015
##                                stop_name stop_desc stop_lat  stop_lon
## 1                                   YMCA      <NA> 38.04909 -78.47792
## 2      The Pantops Shopping Ctr at Roses      <NA> 38.03150 -78.45877
## 3 Emmet St at the Central Grounds Garage      <NA> 38.03628 -78.50786
## 4   Emmet St at the Snyder Tennis Courts      <NA> 38.03833 -78.50706
## 5            Alderman Road at Stadium Rd      <NA> 38.02906 -78.51638
## 6                E Water St at 3rd St SE      <NA> 38.02956 -78.48014
##                     stop_url location_type parent_station stop_timezone
## 1 http://www.CatchTheCAT.org             0           <NA>          <NA>
## 2 http://www.CatchTheCAT.org             0           <NA>          <NA>
## 3 http://www.CatchTheCAT.org             0           <NA>          <NA>
## 4 http://www.CatchTheCAT.org             0           <NA>          <NA>
## 5 http://www.CatchTheCAT.org             0           <NA>          <NA>
## 6 http://www.CatchTheCAT.org             0           <NA>          <NA>
##   wheelchair_boarding                   geometry
## 1                  NA POINT (-78.47792 38.04909)
## 2                  NA  POINT (-78.45877 38.0315)
## 3                  NA POINT (-78.50786 38.03628)
## 4                  NA POINT (-78.50706 38.03833)
## 5                  NA POINT (-78.51638 38.02906)
## 6                  NA POINT (-78.48014 38.02956)

Now we just have to combine our two layers, routes and stops.

To make a map that looks a lot like to the result from map_gtfs(), I’m gonna tweak a few of the defaults to addCircleMarkers().

leaflet(cat_routes_sf) %>%
    addTiles() %>%
    addPolylines(color = ~paste0("#", route_color),
                 label = ~htmltools::htmlEscape(route_short_name)) %>%
    addCircleMarkers(data = cat_stops_sf, fillColor = ~paste0("#", route_color),
                     label = ~htmltools::htmlEscape(stop_name),
                     color = "black", radius = 5, weight = 3) %>%
    addLegend(colors = ~unique(paste0("#", route_color)),
              labels = ~unique(route_short_name)) %>%
    widgetframe::frameWidget()

👌

Wrap up

Now we can build sf objects from gtfs objects at-will!

That gives us new plotting flexibility and makes incorporating new data more straight forward.

Looks like exploring transit patterns in R just got a little easier! Now if only all transit data was already cleaned into GTFS …

devtools::session_info()
##  setting  value                       
##  version  R version 3.5.0 (2018-04-23)
##  system   x86_64, darwin15.6.0        
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  tz       America/New_York            
##  date     2018-11-15                  
## 
##  package     * version    date       source                            
##  assertthat    0.2.0      2017-04-11 CRAN (R 3.5.0)                    
##  backports     1.1.2      2017-12-13 CRAN (R 3.5.0)                    
##  base        * 3.5.0      2018-04-24 local                             
##  bindr         0.1.1      2018-03-13 CRAN (R 3.5.0)                    
##  bindrcpp    * 0.2.2      2018-03-29 CRAN (R 3.5.0)                    
##  blogdown      0.6        2018-04-18 CRAN (R 3.5.0)                    
##  bookdown      0.7        2018-02-18 CRAN (R 3.5.0)                    
##  broom         0.5.0      2018-07-17 CRAN (R 3.5.0)                    
##  cellranger    1.1.0      2016-07-27 CRAN (R 3.5.0)                    
##  class         7.3-14     2015-08-30 CRAN (R 3.5.0)                    
##  classInt      0.2-3      2018-04-16 CRAN (R 3.5.0)                    
##  cli           1.0.1      2018-09-25 cran (@1.0.1)                     
##  colorspace    1.3-2      2016-12-14 CRAN (R 3.5.0)                    
##  compiler      3.5.0      2018-04-24 local                             
##  crayon        1.3.4      2018-11-02 Github (r-lib/crayon@0398b12)     
##  crosstalk     1.0.0      2016-12-21 CRAN (R 3.5.0)                    
##  curl          3.2        2018-03-28 CRAN (R 3.5.0)                    
##  datasets    * 3.5.0      2018-04-24 local                             
##  DBI           1.0.0      2018-05-02 CRAN (R 3.5.0)                    
##  devtools      1.13.6     2018-06-27 CRAN (R 3.5.0)                    
##  digest        0.6.18     2018-10-10 cran (@0.6.18)                    
##  dplyr       * 0.7.8      2018-11-10 CRAN (R 3.5.0)                    
##  e1071         1.6-8      2017-02-02 CRAN (R 3.5.0)                    
##  evaluate      0.12       2018-10-09 cran (@0.12)                      
##  forcats     * 0.3.0      2018-02-19 CRAN (R 3.5.0)                    
##  ggplot2     * 3.1.0.9000 2018-11-05 Github (tidyverse/ggplot2@15ddc22)
##  glue          1.3.0      2018-07-17 cran (@1.3.0)                     
##  graphics    * 3.5.0      2018-04-24 local                             
##  grDevices   * 3.5.0      2018-04-24 local                             
##  grid          3.5.0      2018-04-24 local                             
##  gtable        0.2.0      2016-02-26 CRAN (R 3.5.0)                    
##  gtfsr       * 1.0.3      2018-05-26 Github (ropensci/gtfsr@3af920a)   
##  haven         1.1.2      2018-06-27 cran (@1.1.2)                     
##  hms           0.4.2      2018-03-10 CRAN (R 3.5.0)                    
##  htmltools     0.3.6      2017-04-28 CRAN (R 3.5.0)                    
##  htmlwidgets   1.2        2018-04-19 CRAN (R 3.5.0)                    
##  httpuv        1.4.5.9000 2018-10-24 Github (rstudio/httpuv@2cbf7bd)   
##  httr          1.3.1      2017-08-20 CRAN (R 3.5.0)                    
##  jsonlite      1.5        2017-06-01 CRAN (R 3.5.0)                    
##  knitr         1.20       2018-02-20 CRAN (R 3.5.0)                    
##  later         0.7.5      2018-09-18 cran (@0.7.5)                     
##  lattice       0.20-35    2017-03-25 CRAN (R 3.5.0)                    
##  lazyeval      0.2.1      2017-10-29 CRAN (R 3.5.0)                    
##  leaflet     * 2.0.2      2018-08-27 CRAN (R 3.5.0)                    
##  lubridate     1.7.4      2018-04-11 CRAN (R 3.5.0)                    
##  magrittr    * 1.5        2014-11-22 CRAN (R 3.5.0)                    
##  memoise       1.1.0      2017-04-21 CRAN (R 3.5.0)                    
##  methods     * 3.5.0      2018-04-24 local                             
##  mime          0.6        2018-10-05 cran (@0.6)                       
##  modelr        0.1.1      2017-07-24 CRAN (R 3.5.0)                    
##  munsell       0.5.0      2018-06-12 cran (@0.5.0)                     
##  nlme          3.1-137    2018-04-07 CRAN (R 3.5.0)                    
##  pillar        1.3.0.9000 2018-08-18 Github (r-lib/pillar@3fabb4e)     
##  pkgconfig     2.0.2      2018-08-16 cran (@2.0.2)                     
##  plyr          1.8.4      2016-06-08 CRAN (R 3.5.0)                    
##  promises      1.0.1      2018-04-13 CRAN (R 3.5.0)                    
##  purrr       * 0.2.5      2018-05-29 cran (@0.2.5)                     
##  R6            2.3.0      2018-10-04 cran (@2.3.0)                     
##  Rcpp          1.0.0      2018-11-07 cran (@1.0.0)                     
##  readr       * 1.1.1      2017-05-16 CRAN (R 3.5.0)                    
##  readxl        1.1.0      2018-04-20 CRAN (R 3.5.0)                    
##  rgdal         1.2-18     2018-03-17 CRAN (R 3.5.0)                    
##  rlang         0.3.0.1    2018-10-25 cran (@0.3.0.1)                   
##  rmarkdown     1.10       2018-06-11 CRAN (R 3.5.0)                    
##  rprojroot     1.3-2      2018-01-03 CRAN (R 3.5.0)                    
##  rstudioapi    0.8        2018-10-02 cran (@0.8)                       
##  rvest         0.3.2      2016-06-17 CRAN (R 3.5.0)                    
##  scales        1.0.0      2018-08-09 cran (@1.0.0)                     
##  sf          * 0.6-3      2018-05-17 CRAN (R 3.5.0)                    
##  shiny         1.1.0.9001 2018-09-27 Github (rstudio/shiny@3cea5fb)    
##  sp            1.3-1      2018-06-05 cran (@1.3-1)                     
##  spData        0.2.8.3    2018-03-25 CRAN (R 3.5.0)                    
##  stats       * 3.5.0      2018-04-24 local                             
##  stringi       1.2.4      2018-07-20 cran (@1.2.4)                     
##  stringr     * 1.3.1      2018-05-10 CRAN (R 3.5.0)                    
##  tibble      * 1.4.2      2018-01-22 CRAN (R 3.5.0)                    
##  tidyr       * 0.8.2      2018-10-28 cran (@0.8.2)                     
##  tidyselect    0.2.5      2018-10-11 cran (@0.2.5)                     
##  tidyverse   * 1.2.1      2017-11-14 CRAN (R 3.5.0)                    
##  tools         3.5.0      2018-04-24 local                             
##  udunits2      0.13       2016-11-17 CRAN (R 3.5.0)                    
##  units         0.5-1      2018-01-08 CRAN (R 3.5.0)                    
##  utils       * 3.5.0      2018-04-24 local                             
##  widgetframe   0.3.1      2017-12-20 CRAN (R 3.5.0)                    
##  withr         2.1.2.9000 2018-10-18 Github (jimhester/withr@be57595)  
##  xfun          0.2        2018-06-16 cran (@0.2)                       
##  xml2          1.2.0      2018-01-24 CRAN (R 3.5.0)                    
##  xtable        1.8-3      2018-08-29 cran (@1.8-3)                     
##  yaml          2.1.19     2018-05-01 cran (@2.1.19)

Geocoded crime reports for Charlottesville Virginia

November 27, 2018 - 5 minutes
Civic Data packages sf tidyverse

Quantifying time in meetings

September 2, 2018 - 5 minutes
Using Google Apps Script and R to analyze Google Calendars.
Business Intelligence googlesheets tidyverse

Maps with the new ggplot2 v3.0.0

August 4, 2018 - 2 minutes
Civic Data ggplot2 tidyverse