Twitter, Regular Expressions, and ggmap

Goal: I want to plot a sample of bicycle accidents and find out the most dangerous intersections.

Solution:  There is a twitter handle called struckDC that tweets bicyle or pederstrain  accidents. I will need to

  1. extract tweets into dataframe
  2. pull out address from messy tweets
  3. geocode addresses
  4. plot data

1. Extract Tweets Into Dataframe


#set up  keys
api_key <- "INSERT HERE"
api_secret <- "INSERT HERE"
access_token <- "INSERT HERE"
access_token_secret <- "INSERT HERE"

# pulling 1800 tweets from struckDC handle
df <- getDataByHandle(handle = c("struckdc"),number=1800 )

I wrote this function to pull the tweets. It is not the prettiest but it worked. You can even make the handle-parameter a list and bring data in from multiple handles. The function is limited in that it does not address twitters API limit.

getDataBYHandle <-
    tweets.df <- read.csv(text="col1,col2")
    dat <- data.frame()
    for( i in 1:length(handle)) {
        tweets <- userTimeline(handle[i], n = number)
        tweets.df2 <- twListToDF(tweets)
      }, error=function(e){cat("ERROR :",conditionMessage(e), "\n")}) 
      tweets.df <- rbind.fill(tweets.df2,tweets.df)

2. Pull out Address from messy tweets
Using Regular Expressions

In twitter there is no order there is only chaos. As a data scientist, the more order the easier the analysis. Regular expressions can help us to extract a piece of text to whatever our minds desire. The code below is a surgical approach to remove all the miscellaneous things in the texts to try to create some order. Ideally it might be best to build some type of machine learning algorithm to do this, but this is just a blog.

'''finds  the word --update-- at the begging of the string and changes that tweet to a blank else it paste the lowercase of the tweet. 3 steps with 1 stone approach'''
df$address <- gsub("^\\update.*","",tolower(df$text))

#finds any hashtags and removes the tag until the first white-space
df$address <- gsub("\\#\\S*","",df$address)

#removes everything inside parenthesis and the parenthesis themselves 
df$address <- gsub("\\(.*\\)","",df$address)

#removes everything with an @ and everything after until the first white-space
df$address <- gsub("\\@\\S*","",df$address)

# removes links that have an http
df$address <- gsub("\\http\\S*","",df$address)

# removes all punctuation 
df$address <- gsub("[[:punct:]]","",df$address)

#Here is a list of all words I want to remove
list <- c("pedestrian","struck","cyclist","mt","amp:","&amp;","at","fatality","ped","police","hit" ,"adult", "male","ambulance","bicyclist","report","driver","hits","crash","near","killed","pedestrians")

## Now I paste these words into a a string concatenated by a pipe
pat <- paste0("\\b(", paste0(list, collapse="|"), ")\\b") 

##Now I remove the words
df$address <- gsub(pat,"",df$address)

The code above got me part of the way there; it cleaned up about 50% of the tweets. I didn’t want spend the day writing the perfect code so I cleaned the rest manually, but you can geocode without cleaning them from here.

3. Geocode addresses
Using google’s API to geocode address


'''I created a string of the address and put the City on the end to get better results'''
df$string <- paste(df$address ," Washington D.C.")

'''if you knew there would be no errors you can do this:
df2 <- cbind(df, geocode(df$string,output="more"))
I looped through the data so I can skip over the errors because we are working with bad data we will have some locations that will not be geocoded'''

new <- data.frame()
for (i in  seq_along(df$string)){
  df2 <- geocode(df$string[i],output="more")
  df2 <- cbind(df2,df[i,])
  new <- rbind.fill(df2,new)
  }, error=function(e){cat("ERROR :",conditionMessage(e), "\n")}) 

4. graph data
Using ggmap to plot data


'''geocode the place you want to map'''
DC = as.numeric(geocode("Washington D.C"))

'''create a map of the place change zoom as appropriate'''
dcMap =  qmap(DC, zoom = 13, maptype = "toner",scale=2)

'''create map'''
dcMap + geom_point(data = new2,aes(x = lon, y = lat), colour = 'black',size=1.5)+
    aes(x = lon, y = lat),
    size = 100, bins = 30, alpha = 0.5, 
    data = new2
  ) +scale_fill_gradient( low= "lightyellow",high = "red")