Temporal vs Spatial – Visualization Design

THE VIZ

Temporal Designs:  are when we convey information over time. Think of a PowerPoint deck when the presenter clicks through each slide to see the next piece of information.  Or, a filter on a dashboard, the user clicks to see the information in a granular view. These designs have merit. They can pack a lot of information over the entirety of the document, but they have a flaw — the human brain has a hard time of comprehending differences when information is displayed over time.

A day trader does not have one monitor; she does not flip through different views of stocks and indicators to make a quick decision.  She has 6 monitors with each view displayed in space and not in disparate times. This allows her to see patterns quickly

An air traffic controller has many monitors because she has to understand the data rapidly to make correct decisions.

Spatial Designs: allow for the user to process data quickly. We have evolved to see patterns, and our brains are good at it, but are much better when we can see it all at once.

Good visualization design will thus promote our ability to see data in space as much as possible; it will limit the user from figuring out the pattern by flipping back and forth like some patternless nonrecogniizing Australopithecus afarensis who didn’t have the luxury of evolving

Superior pattern processing

for the past 2 million years.

Take a look at the recent viz I made.

  • Does it allow for quick pattern recognition?
  • Can you see things in the data quickly because of its spacital design?

Research

Twitter, Regular Expressions, and ggmap

Goal: I want to plot a sample of bicycle accidents and find out the most dangerous intersections.

Solution:  There is a twitter handle called struckDC that tweets bicyle or pederstrain  accidents. I will need to

  1. extract tweets into dataframe
  2. pull out address from messy tweets
  3. geocode addresses
  4. plot data

1. Extract Tweets Into Dataframe

library(devtools)
library(twitteR)
library(ROAuth)
source("twitterFunction.r")

#set up  keys
api_key <- "INSERT HERE"
api_secret <- "INSERT HERE"
access_token <- "INSERT HERE"
access_token_secret <- "INSERT HERE"
setup_twitter_oauth(api_key,api_secret,access_token,access_token_secret)

# pulling 1800 tweets from struckDC handle
df <- getDataByHandle(handle = c("struckdc"),number=1800 )

I wrote this function to pull the tweets. It is not the prettiest but it worked. You can even make the handle-parameter a list and bring data in from multiple handles. The function is limited in that it does not address twitters API limit.


getDataBYHandle <-
  function(handle,number=100){
    tweets.df <- read.csv(text="col1,col2")
    dat <- data.frame()
    for( i in 1:length(handle)) {
      tryCatch({
        tweets <- userTimeline(handle[i], n = number)
        tweets.df2 <- twListToDF(tweets)
      }, error=function(e){cat("ERROR :",conditionMessage(e), "\n")}) 
      tweets.df <- rbind.fill(tweets.df2,tweets.df)
    }
    return(tweets.df)
  }

2. Pull out Address from messy tweets
Using Regular Expressions

In twitter there is no order there is only chaos. As a data scientist, the more order the easier the analysis. Regular expressions can help us to extract a piece of text to whatever our minds desire. The code below is a surgical approach to remove all the miscellaneous things in the texts to try to create some order. Ideally it might be best to build some type of machine learning algorithm to do this, but this is just a blog.



'''finds  the word --update-- at the begging of the string and changes that tweet to a blank else it paste the lowercase of the tweet. 3 steps with 1 stone approach'''
df$address <- gsub("^\\update.*","",tolower(df$text))

#finds any hashtags and removes the tag until the first white-space
df$address <- gsub("\\#\\S*","",df$address)

#removes everything inside parenthesis and the parenthesis themselves 
df$address <- gsub("\\(.*\\)","",df$address)

#removes everything with an @ and everything after until the first white-space
df$address <- gsub("\\@\\S*","",df$address)

# removes links that have an http
df$address <- gsub("\\http\\S*","",df$address)

# removes all punctuation 
df$address <- gsub("[[:punct:]]","",df$address)

#Here is a list of all words I want to remove
list <- c("pedestrian","struck","cyclist","mt","amp:","&amp;","at","fatality","ped","police","hit" ,"adult", "male","ambulance","bicyclist","report","driver","hits","crash","near","killed","pedestrians")

## Now I paste these words into a a string concatenated by a pipe
pat <- paste0("\\b(", paste0(list, collapse="|"), ")\\b") 

##Now I remove the words
df$address <- gsub(pat,"",df$address)

The code above got me part of the way there; it cleaned up about 50% of the tweets. I didn’t want spend the day writing the perfect code so I cleaned the rest manually, but you can geocode without cleaning them from here.

3. Geocode addresses
Using google’s API to geocode address

library(ggmap)

'''I created a string of the address and put the City on the end to get better results'''
df$string <- paste(df$address ," Washington D.C.")

'''if you knew there would be no errors you can do this:
df2 <- cbind(df, geocode(df$string,output="more"))
I looped through the data so I can skip over the errors because we are working with bad data we will have some locations that will not be geocoded'''

new <- data.frame()
for (i in  seq_along(df$string)){
  tryCatch({
  df2 <- geocode(df$string[i],output="more")
  df2 <- cbind(df2,df[i,])
  new <- rbind.fill(df2,new)
  }, error=function(e){cat("ERROR :",conditionMessage(e), "\n")}) 
}

4. graph data
Using ggmap to plot data

library(MASS)
library(RgoogleMaps)
library(RColorBrewer)

'''geocode the place you want to map'''
DC = as.numeric(geocode("Washington D.C"))

'''create a map of the place change zoom as appropriate'''
dcMap =  qmap(DC, zoom = 13, maptype = "toner",scale=2)

'''create map'''
dcMap + geom_point(data = new2,aes(x = lon, y = lat), colour = 'black',size=1.5)+
  stat_bin2d(
    aes(x = lon, y = lat),
    size = 100, bins = 30, alpha = 0.5, 
    data = new2
  ) +scale_fill_gradient( low= "lightyellow",high = "red")
  

aggregating by census tract

See it on Tableau

the goal:
I want to find the median value of lead samples in Parts Per Billion by census tract. To do that I will need to find the census tract associated with each latitude and longitude and then find the median value of all observations within that tract.

the data:
I have a dataframe that looks like this:

lon lat PPB
-77.034018 38.9138584 0

the code:

step 1.

I will use FCC’s API. I can send it latitude and longitude data and it will return the census block (among other stuff). You really only need a few lines of code, but below I wrote a class so the user can get the data they want. The class gets initiated with two parameters latitude and longitude this gets inserted into the url payload courtesy of the requests module. The object r is created and then the json is extracted to a new object y. The other four functions in the class simply extract the data element(s) from the object y

import requests
import json

class censusData:

    def __init__(self,lat,lon,showall=True):
        url = 'http://data.fcc.gov/api/block/find?format=json'
        payload = {'latitude': lat,'longitude': lon,'showall': showall}
        self.r  = requests.get(url, params=payload)
        self.y = self.r.json()

    def block(self):
        return str(self.y['Block']['FIPS'])

    def county(self):
        return str(self.y['County']['name'])

    def state(self):
        return str(self.y['State']['name'])

    def intersection(self):
        records = []
        for  b in self.y['Block']['intersection']:
            record = filter(lambda x: x.isdigit(), str(b))
            records.append(record)
        return records
    def data(self):
        return json.dumps(self.y)

step 2: using the class
I can call the api now simply like so:

censusData(28.35975,-81.421988).block()

but, how can we apply it to a large dataframe?
hint* use apply

import pandas as pd
from census import censusData
#bring in data
df =pd.read_csv("data.csv")
'''
this is when using apply is your greatest friend.
I am applying my function to my data frame.
No need for a messy for loop.
'''
df['Tract'] =df.apply(lambda x: censusData(df['lat'], df['lon']).block(), axis=1)

now I have a dateframe that looks like this:

lon lat PPB Tract
-77.034018 38.9138584 0 110010081002007

web scraping aircraft crashes

See it on Tableau

the goal:
I want to visualize all plane crashes that led to human fatalities.

the data:

The best data I could find was at https://aviation-safety.net/
The data is not in a very consumable structure; it is posted as static html tables throughout the website. I will need to iterate through every unique aircraft accident on the website, extract all data, and put it into a structure that will help me visualize the data. This is a perfect job for Python!

the code:


from bs4 import BeautifulSoup as bs
import urllib2
import pandas as pd
import requests

def getPlaneDataa(yearStart, yearEnd):
    number = yearEnd - yearStart + 1
    fs = pd.DataFrame()
    for y in range(number):
        lista = []
        yearStart +=1
        for x in range(1, 3):
            firstLink = 'http://www.aviation-safety.net/database/dblist.php?Year=' + str(yearStart)+ &quot;&amp;lang=&amp;page=&quot; + str(x)
            r =requests.get(firstLink)
            html = r.text
            soup= bs(html)
            for link in soup.find_all('a', href=True):
                lista.append(link['href'])
        u = [x for x in lista if  x.startswith('/database/r')]
        content = list(set(u))
        #main loop through all links just extracted gets html content of each link and extracts the table in each file
        for a in content:
            link = 'http://www.aviation-safety.net' + a
            req = urllib2.Request(link)
            req.add_unredirected_header('User-Agent', 'Custom User-Agent')
            html2 =urllib2.urlopen(req).read()
            table = bs(html2)
            try:
                tab= table.find_all('table')[0]
                records = []
                for tr in tab.findAll('tr'):
                    trs = tr.findAll('td')
                    th = tr.findAll('th')
                    record = []
                    record.append(trs[0].text)
                    try:
                        record.append(trs[1].text)
                    except:
                        continue
                        record.append(th[0].text)
                    records.append(record)
                df = pd.DataFrame(data=records)
            except:
                pass
            df.set_index(df[0],inplace=True)
            df = pd.DataFrame(df.ix[:,1])
            df = pd.DataFrame.transpose(df)
            fs=fs.append(df)
    return pd.DataFrame(fs)

an aside about the code
The function uses two parameters yearStart and yearEnd. It then iterates through the difference of the years and extracts the html from the webpage of the current iteration. It iterates again through the html and pulls out all links that have the word “/database”. and constructs a list called content of all webpages that have data. The code runs the main loop through each item in content and extracts the first data-table: tab= table.find_all(‘table’)[0]

the viz:
here

I want to visualize all the data in one chart. A sorted stacked bar graph I thought was the best approach. I sorted by first the category of crashes then by the amount of fatalities. This allows the reader to first compare fatalities by category then quickly find the largest plane crash in any given year.  It provides overall trends but allows for granular understanding, and all in one chart.