Geocoding the CPD crime reports

December 9, 2017 - 9 minutes
Geocoding the Charlottesville Police Dept's crime reports with Google Maps
Cville Open Data data wrangling geocode ggmap

Geocoding

“Is the process of converting addresses (like a street address) into geographic coordinates (like latitude and longitude), which you can use to place markers on a map, or position the map.” - Google Maps API Documentation

Unfortunately the Crime Data from the Charllottesville Open Data Portal (OPD) doesn’t come currated with latitude and longitude coordinates. So it is up to us to geocode it, by accessing the Google Maps API via the great R 📦 ggmap.

# Load Out -----------------------------------------------
library(geojsonio)
library(ggmap)
library(tidyverse) # always
library(magrittr) # %<>% life

Import

Using the OPD’s API is easy with library(geojsonio) and I made a separate API post here, if you want more details. I like to use geojsonio::geojson_read to import any data from the ODP. By default the return value will be a list object, in the “geoJSON” format. Becasue the crime data has no geometry/shape components, we can parse the original list down to just a tibble.

crime_json <- geojson_read("https://opendata.arcgis.com/datasets/d1877e350fad45d192d233d2b2600156_7.geojson",
                           parse = TRUE) # ~ 1-5 minutes...for a 2 element list

# parse out table of interest
crime_df <- crime_json[["features"]]
crime_df <-  crime_df[["properties"]]

We now have 31,885 crime reports in our hands.

This dataset contains all of the crimes reported in Charlottesville since the fall of 2012, except for homicides. I hope the homicde reports are included in the future, becasue homicides are the worst type of threat in terms of public safety and the public deserves to know.

Since geocoding is based exclusivley on the address of a location, we need to make sure to pay special attention to the quality of information in the BlockNumber and StreetName columns. Since this dataset it populated directly from reports, there will inevitably be some typos, but we can take a few simple steps to improve our geocoding success.

First lets look at BlockNumber. This dataset does not include specific addresses by design, in an effort to balance privacy and public safety, so all values are rounded to the nearest hundred.

# first step always trim white space
crime_df %<>% mutate_at(vars(BlockNumber, StreetName), funs(trimws(.)))

# convert to numeric for sorting
crime_df$BlockNumber %<>% as.numeric()
table(crime_df$BlockNumber, useNA = "always") %>% sort()
## 
##  2800  3100  3400  3800  4400  6200  6300  6500  8000  8200 10000 10800 
##     1     1     1     1     1     1     1     1     1     1     1     1 
## 11000 11100 12000 12200 14500 15000 15600 28000  2900  3300  5300  7200 
##     1     1     1     1     1     1     1     1     2     2     2     2 
##  9100  3000  <NA>  2600  4000  2700  2400  2500  2300  2200  2000  1700 
##     2     4    19    22    26    42   122   123   228   276   357   418 
##  2100     0  1800  1900  1300  1200  1600  1500  1000  1400  1100   900 
##   435   498   618   632   906   950   955  1047  1252  1460  1570  1685 
##   300   800   500   700   400   600   200   100 
##  1713  1722  1937  1951  1967  2265  2962  3693

From my personal expeirence, I know that addresses for the “0” block, like “0 Avon St”, geocode poorly. To combat this I am recoding all reports from a zero hundred block to a one hundred block. This is small shift geographically, and since it often results in a more accurate location we are likely improving subsequent spatial analysis compared to the zero hundred block call, which is usally called somewhere near the middle of the street’s length. Also I am recoding the NA values as one hundred block values as well.

crime_df$BlockNumber %<>% ifelse(is.na(.), 100, .) %>% ifelse(. == 0, 100, .)
table(crime_df$BlockNumber, useNA = "always") # better
## 
##   100   200   300   400   500   600   700   800   900  1000  1100  1200 
##  4210  2962  1713  1967  1937  2265  1951  1722  1685  1252  1570   950 
##  1300  1400  1500  1600  1700  1800  1900  2000  2100  2200  2300  2400 
##   906  1460  1047   955   418   618   632   357   435   276   228   122 
##  2500  2600  2700  2800  2900  3000  3100  3300  3400  3800  4000  4400 
##   123    22    42     1     2     4     1     2     1     1    26     1 
##  5300  6200  6300  6500  7200  8000  8200  9100 10000 10800 11000 11100 
##     2     1     1     1     2     1     1     2     1     1     1     1 
## 12000 12200 14500 15000 15600 28000  <NA> 
##     1     1     1     1     1     1     0

Next lets look at the StreetName values. Here we can expect a lot more unique values and a lot more typing mistakes.

table(crime_df$StreetName) %>% sort(decreasing = TRUE) %>% head(20) # don't print too much
## 
##        E MARKET ST          W MAIN ST         EMMET ST N 
##               1786               1332               1127 
##          E MAIN ST JEFFERSON PARK AVE          5TH ST SW 
##                756                714                563 
##        PRESTON AVE           RIDGE ST         CHERRY AVE 
##                560                547                527 
##       PROSPECT AVE         GARRETT ST         14TH ST NW 
##                502                480                479 
##     UNIVERSITY AVE            PARK ST        WERTLAND ST 
##                469                383                363 
##        CARLTON AVE         E WATER ST           1ST ST S 
##                352                338                336 
##           HARDY DR         W WATER ST 
##                328                309
length(table(crime_df$StreetName)) # 1704
## [1] 1691

We see the most popular streets for crime reports are “E MARKET ST”, “MAIN ST” (both W and E) and “N EMMET ST”. This is not suprising since Market St and Main St are the two major East-West roads through downtown, and Emmet St is the major North-South route.

Since there are almost 1700 unique street names, of course we see things like “W MAINS T” and “WEST MAIN”, that are obvious typos, but because manually adjusting all of these would be tedious work and these mistaken street names represent a small fraction of the total cases, we will leave them for now. Plus leaving them in also will make it easy to merge the geocoded data back in later on.

The final step in our geocode prep is to append the correct “City, State”. Since almost every American town has a Main St, we need to be specific that we only care about the Main St in Charlottesville Virginia. So now all that’s left to do is paste everything together and let Google handle the heavy lifting.

crime_df %<>% mutate(address = paste(BlockNumber, StreetName, "Charlottesville VA"))

Google Maps API

Google Maps are the best and of course Google offers some really nice APIs to work with them. This makes using Google Maps attractive to companies and because of this Google has established tiered (read $$$) access levels to these utilities. Any individual can submit 2,500 request to the API per day, for free, which is really nice. Beyond that a billing account is required and the nominal fee of $0.50 per 1,000 requests is charged. That means if we wanted to run a query for every report in the dataset we would have to wait 13 days or pay ~ $15 dollars. The fifteen dollars doesn’t sound too bad, but if we filter down to unique addresses only we can cut our costs drastically.

# check for unique addresses
address_df <- data.frame(address = unique(crime_df$address))
nrow(address_df) # 3144
## [1] 3159

Now we are down to more palatable number of 3,144 unique location, that we can break up into just two days worth of API querying and still keep it free. There is a free alternative to using Google Maps, called the Data Science Toolkit (DSK). The DSK does not impose query limit restrictions, but I think that Google does a better job. You can specify which source to use when running ggmap::geocode with the source argument, Google is the default and the one I will be using here.

address_list <- split(address_df,
                      rep(c(T,F), length.out = nrow(address_df)))

res1a <- geocode(address_list[[1]]$address, source = "google", output = "all")
res2a <- geocode(address_list[[2]]$address, source = "google", output = "all")
res1a <- readRDS("~/future/cville_crime/res1a.RDS")
res2a <- readRDS("~/future/cville_crime/res2a.RDS")

The argument output defaults to “latlon”, which will nicely return just the lattitude and longitude coordinates. Since I wanted te be sure the geocoding was behaving itself I opted for the “all” option, which returns a JSON nest list object with a lot more information, including the formatted address actually used by google for each query and the status of the query. The status value is useful to check and see how many address were succesfully coded.

map_lgl(res1a, ~.["status"] == "OK") %>% sum(na.rm = T) # 1550 / 1572
map_lgl(res2a, ~.["status"] == "OK") %>% sum(na.rm = T) # 1551 / 1572

Because I choose to pull all of this extra data, I need to do a little work extracting it, so I wrote a helper function to parse all of the returned JSON objects. If you want to create your own parser function, practice using le <- res1a[[1]] until you have what you want.

extractor <- function(le) {
    if (length(le) == 2) {
        if (le$status == "OK") { # so we ignore status: ZERO_RESULTS
            res <- le$results %>% unlist() %>% bind_rows() %>%
                select(lat = geometry.location.lat,
                       lon = geometry.location.lng,
                       formatted_address,
                       geometry_loc_type = geometry.location_type)
        }
    }
    else { res <- tibble(formatted_address = NA) } # leave a place holder for misses
    return(unique(res))
}

res1a_parsed <- map_df(res1a, extractor, .id = "query")
res2a_parsed <- map_df(res2a, extractor, .id = "query")

Now that I have all of those long nast JSON objects cleaned up into tidy tibbles, all I need to do it bring the two group back together again and merge with the original address_df we used to populate the geocoding queries.

res <- bind_rows(res1a_parsed, res2a_parsed) %>% 
    full_join(address_df, ., by = c("address" = "query")) # on Github

This final table with all of the unique address geocodes is available on [my GitHub here](https://github.com/NathanCDay/cville_crime/blob/master/addresses_geocode.csv. I plan to periodically update this as new versions of the crime data become available, but I will checke here first, so I don’t needlessly rerun queries. Eventually this may turn into an offline geocoder for Charlottesville.

Becasue I like making things easy, I also have a full geocoded version of the Crime dataset we downloaded the the ODP available here on my GitHub too. But if you want to make your own just use inner_join().

crime <- inner_join(crime_df, res) # also on Github

If you notice that my GitHub repository is out of sync with the ODP (ie my repo was last updated prior to the most recent Crime dataset update), you will want to check against the unique address table (perhaps using anti_join or full_join as a first step instead of inner_join) and geocode any new addresses that failed to match. Also feel free to open an issue on my GitHub repo and I will make adjustments for the updated data.

Parking Meter Pilot

February 24, 2018 - 14 minutes
A vizual introductuion to stats and trends of the city's cancelled Downtown mall parking meter program.
R EDA Cville Open Data data wrangling dates time series tidyverse

Spatial viz of drug crime

December 30, 2017 - 14 minutes
Using library(sf) to visualize the spatial distribution of drug related crimes in Chalottesville
Cville Open Data geocode sf spatial