Geocoded crime reports for Charlottesville Virginia

November 27, 2018 - 5 minutes
Civic Data packages sf tidyverse

cpdcrimedata

Is a R data package, with a geocoded version of the Charlottesville Police Department’s public Assistant Reports for the last five years.

To install the package from GitHub:

devtools::install_github("nathancday/cpdcrimedata")

library(cpdcrimedata)

library(tidyverse) # for manipulation tools

The primary dataset is cpd_crime, the original report’s 9 columns (UpperCamel), plus 4 new ones (lower_snake) related to geocoding:

  • formatted_address - address used in the successful GoogleAPI query
  • lat - lattitude value returned
  • lon - longitude value returned
  • loc_type - type of location returned
data(cpd_crime)

names(cpd_crime)
##  [1] "RecordID"          "Offense"           "IncidentID"       
##  [4] "BlockNumber"       "StreetName"        "Agency"           
##  [7] "DateReported"      "HourReported"      "address"          
## [10] "lat"               "lon"               "formatted_address"
## [13] "loc_type"
map(cpd_crime, ~ table(.) %>% sort(decreasing = T) %>% head)
## $RecordID
## .
## 1 2 3 4 5 6 
## 1 1 1 1 1 1 
## 
## $Offense
## .
##                   Towed Vehicle                  Assault Simple 
##                            2627                            2605 
##                     Hit and Run                       Vandalism 
##                            2221                            1929 
##             Larceny - All Other Assist Citizen - Mental/TDO/ECO 
##                            1801                            1625 
## 
## $IncidentID
## .
## 201000073238 201300002276 201300002481 201300003848 201300004323 
##            1            1            1            1            1 
## 201300004460 
##            1 
## 
## $BlockNumber
## .
##  100  200  600  700  500  400 
## 3956 2676 2296 1895 1846 1803 
## 
## $StreetName
## .
##        E MARKET ST          W MAIN ST         EMMET ST N 
##               1841               1342               1143 
##          E MAIN ST JEFFERSON PARK AVE        PRESTON AVE 
##                685                672                557 
## 
## $Agency
##   CPD 
## 30352 
## 
## $DateReported
## .
## 2016-01-22 2016-09-24 2017-10-16 2015-08-31 2015-10-31 2016-10-14 
##         48         38         38         37         37         37 
## 
## $HourReported
## .
## 1500 1600 1400 1700 1100 0500 
##  264  263  219  204  176  173 
## 
## $address
## .
##  600 E MARKET ST Charlottesville VA 700 PROSPECT AVE Charlottesville VA 
##                                1148                                 489 
##   1100 5TH ST SW Charlottesville VA     800 HARDY DR Charlottesville VA 
##                                 351                                 328 
##   400 GARRETT ST Charlottesville VA  1100 EMMET ST N Charlottesville VA 
##                                 319                                 314 
## 
## $lat
## .
## 38.0304127 38.0245896 38.0513687   38.01713 38.0334203 38.0279731 
##       1167        489        395        352        328        319 
## 
## $lon
## .
## -78.4774586 -78.4946679 -78.5000734  -78.497806 -78.4902161 -78.4803241 
##        1167         489         395         352         328         319 
## 
## $formatted_address
## .
##  600 E Market St, Charlottesville, VA 22902, USA 
##                                             1167 
## 700 Prospect Ave, Charlottesville, VA 22903, USA 
##                                              489 
##  1100 Emmet St N, Charlottesville, VA 22903, USA 
##                                              395 
##   1100 5th St SW, Charlottesville, VA 22902, USA 
##                                              352 
##     800 Hardy Dr, Charlottesville, VA 22903, USA 
##                                              328 
##   400 Garrett St, Charlottesville, VA 22902, USA 
##                                              319 
## 
## $loc_type
## .
## RANGE_INTERPOLATED            ROOFTOP   GEOMETRIC_CENTER 
##              16495              13436                369 
##        APPROXIMATE 
##                  3

The original data is left untouched.

It has all of the orignal warts and wrinkles and you will likely need to a little extra data cleaning. The Offense column has a lot of variants for similar labels.

cpd_crime$Offense %>%
  keep(~ grepl("larceny", ., ignore.case = T)) %>%
  table()
## .
##             Larceny - All Other Larceny - From Coin Oper Device 
##                            1801                               8 
##    Larceny - From Motor Vehicle   Larceny - Of Veh Parts/Access 
##                            1169                             265 
##        Larceny - Pocket Picking       Larceny - Purse Snatching 
##                              34                              15 
##           Larceny - Shoplifitng   Larceny - Theft from Building 
##                             692                             815

Making a plot

Let’s look at 6 most frequent offense labels we saw up above, with ggplot2.

library(tidyverse)

topn <- cpd_crime %>%
  mutate(Offense = fct_infreq(Offense)) %>%
  filter(Offense %in% levels(Offense)[1:6])

By design this dataset contains all of the records in the original, including records were not able to be geocoded. Several addresses were geocoded as outside of the city limits and some are very far away!

To see the spatial distribution of police reports in the city, these “bad” records need to go. Here I’m using US Census maps from the CODP, as the geographic mask to keep only the locations in the city.

library(sf)

# get a census map of charlottesville
cville_census <- st_read("https://opendata.arcgis.com/datasets/63f965c73ddf46429befe1132f7f06e2_15.geojson") %>%
  select(Tract)
## Reading layer `OGRGeoJSON' from data source `https://opendata.arcgis.com/datasets/63f965c73ddf46429befe1132f7f06e2_15.geojson' using driver `GeoJSON'
## Simple feature collection with 12 features and 353 fields
## geometry type:  POLYGON
## dimension:      XY
## bbox:           xmin: -78.52364 ymin: 38.00959 xmax: -78.44631 ymax: 38.0706
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs
topn <- topn %>% 
  filter_at(vars(lat, lon), all_vars(!is.na(.))) %>%
  st_as_sf(coords = c("lon", "lat"), crs = st_crs(cville_census)) %>%
  st_join(cville_census, left = F)

Now we can plot with ggplot2/sf. Since geom_sf() can be prohibitably slow with ~9000 data points, I’m using a work-around with stat_bin_2d.

# add the coordinates as a data frame s for ggplot()
topn <- st_coordinates(topn) %>% 
  as_tibble() %>%
  setNames(c("lon","lat")) %>%
  bind_cols(topn)

# stat_bin() is a good alt geom
ggplot(cville_census) +
  geom_sf() +
  stat_density_2d(data = topn, aes(lon, lat, fill = stat(level)),
                  alpha = .5, geom = "polygon") +
  scale_fill_viridis_c(option = "A", name = "# reports") +
  coord_sf(datum = NA) +
  facet_wrap(~Offense) +
  theme_void()

Going forward

Having this dataset as a R package is making my life easier. It was a good learning experiance for me to put this thing together and I pushed myself to get it set up on for CI with Travis! I’m looking forward to keeping this dataset

Intereseted in converting other Charlottesville data into R packages (possibly one big meta-package) to make civic data analysis with #rstats more accessible/shareable? If you have ideas for other local datasets that could benefit from a package tune-up, send me an email or open an issue

Quantifying time in meetings

September 2, 2018 - 5 minutes
Using Google Apps Script and R to analyze Google Calendars.
Business Intelligence googlesheets tidyverse

Maps with the new ggplot2 v3.0.0

August 4, 2018 - 2 minutes
Civic Data ggplot2 tidyverse

Extending R's GTFS abilities with simple features

June 2, 2018 - 10 minutes
Two functions to easily transition from library(gtfsr) into library(sf) for tidyier transit analysis.
Civic Data sf gtfsr tidyverse