Geocoded crime reports for Charlottesville Virginia

November 27, 2018 - 5 minutes
Civic Data packages sf tidyverse

cpdcrimedata

Is a R data package, with a geocoded version of the Charlottesville Police Department’s public Assistant Reports for the last five years.

To install the package from GitHub:

# remotes::install_github("nathancday/cpdcrimedata")
library(cpdcrimedata)

library(tidyverse) # for manipulation tools

The primary dataset is cpd_crime, the original report’s 9 columns (UpperCamel), plus 4 new ones (lower_snake) related to geocoding:

  • formatted_address - address used in the successful GoogleAPI query
  • lat - lattitude value returned
  • lon - longitude value returned
  • loc_type - type of location returned
data(cpd_crime)

names(cpd_crime)
##  [1] "RecordID"          "Offense"           "IncidentID"       
##  [4] "BlockNumber"       "StreetName"        "Agency"           
##  [7] "DateReported"      "HourReported"      "address"          
## [10] "lat"               "lon"               "formatted_address"
## [13] "loc_type"
map(cpd_crime, ~ table(.) %>% sort(decreasing = T) %>% head)
## $RecordID
## .
## 1 2 3 4 5 6 
## 1 1 1 1 1 1 
## 
## $Offense
## .
##                  Assault Simple                     Hit and Run 
##                            2536                            2274 
##                   Towed Vehicle                       Vandalism 
##                            2171                            1806 
##             Larceny - All Other Assist Citizen - Mental/TDO/ECO 
##                            1791                            1706 
## 
## $IncidentID
## .
## 201000073238 201300004647 201300005270 201300007575 201400000108 201400000504 
##            1            1            1            1            1            1 
## 
## $BlockNumber
## .
##  100  200  600  700  500  800 
## 3938 2467 2224 1847 1765 1749 
## 
## $StreetName
## .
##        E MARKET ST          W MAIN ST         EMMET ST N          E MAIN ST 
##               1778               1286               1136                676 
## JEFFERSON PARK AVE       PROSPECT AVE 
##                594                553 
## 
## $Agency
##   CPD 
## 29475 
## 
## $DateReported
## .
## 2016-01-22 2015-10-31 2016-09-24 2016-10-14 2017-10-16 2015-08-31 
##         48         38         38         38         38         37 
## 
## $HourReported
## .
## 1600 1500 1400 1700 1100 1300 
##  237  231  172  172  151  136 
## 
## $address
## .
##  600 E MARKET ST Charlottesville VA 700 PROSPECT AVE Charlottesville VA 
##                                1120                                 510 
##   1100 5TH ST SW Charlottesville VA     800 HARDY DR Charlottesville VA 
##                                 362                                 347 
##   400 GARRETT ST Charlottesville VA  1100 EMMET ST N Charlottesville VA 
##                                 310                                 309 
## 
## $lat
## .
## 38.0304127 38.0245896 38.0513687   38.01713 38.0334203 38.0279731 
##       1153        510        379        363        347        310 
## 
## $lon
## .
## -78.4774586 -78.4946679 -78.5000734  -78.497806 -78.4902161 -78.4803241 
##        1153         510         379         363         347         310 
## 
## $formatted_address
## .
##  600 E Market St, Charlottesville, VA 22902, USA 
##                                             1156 
## 700 Prospect Ave, Charlottesville, VA 22903, USA 
##                                              510 
##  1100 Emmet St N, Charlottesville, VA 22903, USA 
##                                              379 
##   1100 5th St SW, Charlottesville, VA 22902, USA 
##                                              363 
##     800 Hardy Dr, Charlottesville, VA 22903, USA 
##                                              347 
##   400 Garrett St, Charlottesville, VA 22902, USA 
##                                              310 
## 
## $loc_type
## .
## RANGE_INTERPOLATED            ROOFTOP   GEOMETRIC_CENTER        APPROXIMATE 
##              16072              12897                442                 11

The original data is left untouched.

It has all of the orignal warts and wrinkles and you will likely need to a little extra data cleaning. The Offense column has a lot of variants for similar labels.

cpd_crime$Offense %>%
  keep(~ grepl("larceny", ., ignore.case = T)) %>%
  table()
## .
##             Larceny - All Other Larceny - From Coin Oper Device 
##                            1791                               8 
##    Larceny - From Motor Vehicle   Larceny - Of Veh Parts/Access 
##                            1087                             253 
##        Larceny - Pocket Picking       Larceny - Purse Snatching 
##                              36                              15 
##           Larceny - Shoplifitng   Larceny - Theft from Building 
##                             684                             712

Making a plot

Let’s look at 6 most frequent offense labels we saw up above, with ggplot2.

library(tidyverse)

topn <- cpd_crime %>%
  mutate(Offense = fct_infreq(Offense)) %>%
  filter(Offense %in% levels(Offense)[1:6])

By design this dataset contains all of the records in the original, including records that were not able to be geocoded. Several addresses were geocoded as outside of the city limits and some are very far away!

To see the spatial distribution of police reports in the city, these “bad” records need to go. Here I’m using US Census maps from the CODP, as the geographic mask to keep only the locations in the city.

library(sf)

# get a census map of charlottesville
cville_census <- st_read("https://opendata.arcgis.com/datasets/63f965c73ddf46429befe1132f7f06e2_15.geojson") %>%
  select(Tract)
## Reading layer `7a85474e-f076-4514-8fb3-89d96a99ef8a202045-1-vqdn7z.lu9xc' from data source `https://opendata.arcgis.com/datasets/63f965c73ddf46429befe1132f7f06e2_15.geojson' using driver `GeoJSON'
## Simple feature collection with 12 features and 353 fields
## geometry type:  POLYGON
## dimension:      XY
## bbox:           xmin: -78.52364 ymin: 38.00959 xmax: -78.44631 ymax: 38.0706
## CRS:            4326
topn <- topn %>% 
  filter_at(vars(lat, lon), all_vars(!is.na(.))) %>%
  st_as_sf(coords = c("lon", "lat"), crs = st_crs(cville_census)) %>%
  st_join(cville_census, left = F)

Now we can plot with ggplot2/sf. Since geom_sf() can be prohibitably slow with ~9000 data points, I’m using a work-around with stat_bin_2d.

# add the coordinates as a data frame s for ggplot()
topn <- st_coordinates(topn) %>% 
  as_tibble() %>%
  setNames(c("lon","lat")) %>%
  bind_cols(topn)

# stat_bin() is a good alt geom
ggplot(cville_census) +
  geom_sf() +
  stat_density_2d(data = topn, aes(lon, lat, fill = stat(level)),
                  alpha = .5, geom = "polygon") +
  scale_fill_viridis_c(option = "A", name = "# reports") +
  coord_sf(datum = NA) +
  facet_wrap(~Offense) +
  theme_void()

Going forward

Having this dataset as a R package is making my life easier. It was a good learning experiance for me to put this thing together and I pushed myself to get it set up on for CI with Travis! I’m looking forward to keeping this dataset

Intereseted in converting other Charlottesville data into R packages (possibly one big meta-package) to make civic data analysis with #rstats more accessible/shareable? If you have ideas for other local datasets that could benefit from a package tune-up, send me an email or open an issue

An inputMap for your Shiny app

May 5, 2019 - 3 minutes
data viz leaflet shiny sf

Is the weather getting wetter?

February 20, 2019 - 3 minutes
Exploring historical data from 1905 to 2015 from the World Bank
weather lm sf tidyverse

Quantifying time in meetings

September 2, 2018 - 5 minutes
Using Google Apps Script and R to analyze Google Calendars.
Business Intelligence googlesheets tidyverse