I want to visualize all plane crashes that led to human fatalities.
The best data I could find was at https://aviation-safety.net/
The data is not in a very consumable structure; it is posted as static html tables throughout the website. I will need to iterate through every unique aircraft accident on the website, extract all data, and put it into a structure that will help me visualize the data. This is a perfect job for Python!
from bs4 import BeautifulSoup as bs import urllib2 import pandas as pd import requests def getPlaneDataa(yearStart, yearEnd): number = yearEnd - yearStart + 1 fs = pd.DataFrame() for y in range(number): lista =  yearStart +=1 for x in range(1, 3): firstLink = 'http://www.aviation-safety.net/database/dblist.php?Year=' + str(yearStart)+ "&lang=&page=" + str(x) r =requests.get(firstLink) html = r.text soup= bs(html) for link in soup.find_all('a', href=True): lista.append(link['href']) u = [x for x in lista if x.startswith('/database/r')] content = list(set(u)) #main loop through all links just extracted gets html content of each link and extracts the table in each file for a in content: link = 'http://www.aviation-safety.net' + a req = urllib2.Request(link) req.add_unredirected_header('User-Agent', 'Custom User-Agent') html2 =urllib2.urlopen(req).read() table = bs(html2) try: tab= table.find_all('table') records =  for tr in tab.findAll('tr'): trs = tr.findAll('td') th = tr.findAll('th') record =  record.append(trs.text) try: record.append(trs.text) except: continue record.append(th.text) records.append(record) df = pd.DataFrame(data=records) except: pass df.set_index(df,inplace=True) df = pd.DataFrame(df.ix[:,1]) df = pd.DataFrame.transpose(df) fs=fs.append(df) return pd.DataFrame(fs)
an aside about the code
The function uses two parameters yearStart and yearEnd. It then iterates through the difference of the years and extracts the html from the webpage of the current iteration. It iterates again through the html and pulls out all links that have the word “/database”. and constructs a list called content of all webpages that have data. The code runs the main loop through each item in content and extracts the first data-table: tab= table.find_all(‘table’)
I want to visualize all the data in one chart. A sorted stacked bar graph I thought was the best approach. I sorted by first the category of crashes then by the amount of fatalities. This allows the reader to first compare fatalities by category then quickly find the largest plane crash in any given year. It provides overall trends but allows for granular understanding, and all in one chart.