jonleech
jonleech

Reputation: 461

Scraping with Beautiful Soup

I have stumbled across this excellent post on scraping using Beautiful Soup and I decided to take on the task of scraping some data off the internet to try.

I be using the flight data from Flight Radar 24 and using what was described in the blog post to try automating scraping through the pages for flight data.

import requests
import bs4

root_url = 'http://www.flightradar24.com'
index_url = root_url + '/data/flights/tigerair-tgw/'


def get_flight_id_urls():
     response = requests.get(index_url)
     soup = bs4.BeautifulSoup(response.text)
     return [a.attrs.get('href') for a in soup.select('div.list-group a[href^=/data]')]


flight_id_urls = get_flight_id_urls()
for flight_id_url in flight_id_urls:
    temp_url = root_url + flight_id_url
    response = requests.get(temp_url)
    soup = bs4.BeautifulSoup(response.text)

try:
    table = soup.find('table')
    rows = table.find_all('tr')
    for row in rows:
        flight_data = {}
        flight_data['title'] = soup.select('div#cntPagePreTitle h1')[0].get_text()
        flight_data['tr'] = row #error here
        print (flight_data)

except AttributeError as e:
    raise ValueError("No valid table found")

sample of a flight data page

I stumbled till the table and then realized I don't know how to transverse down the table attributes to get the data that was embedded in each column.

Any kind soul have any clues, even tutorials to intro so that I can read up on how to extract the data.

P.S: credits to Miguel Grinberg for the excellent tutorial

Added

try:
table = soup.find('table')
rows = table.find_all('tr')
heads = [i.text.strip() for i in table.select('thead th')]
for tr in table.select('tbody tr'):
    flight_data = {}
    flight_data['title'] = soup.select('div#cntPagePreTitle h1')[0].get_text()
    flight_data['From'] = tr.select('td.From') 
    flight_data['To'] = tr.select('td.To')

    print (flight_data)

except AttributeError as e:
     raise ValueError("No valid table found")

I changed the last part of my code to form a data object but I can't seem to get the data.

The final edit:

import requests
import bs4

root_url = 'http://www.flightradar24.com'
index_url = root_url + '/data/flights/tigerair-tgw/'


def get_flight_id_urls():
     response = requests.get(index_url)
     soup = bs4.BeautifulSoup(response.text)
     return [a.attrs.get('href') for a in soup.select('div.list-group a[href^=/data]')]


flight_id_urls = get_flight_id_urls()
for flight_id_url in flight_id_urls:
    temp_url = root_url + flight_id_url
    response = requests.get(temp_url)
    soup = bs4.BeautifulSoup(response.text)

try:
    table = soup.find('table')
    rows = table.find_all('tr')
    for row in rows:
        flight_data = {}
        flight_data['flight_number'] = tr['data-flight-number']
        flight_data['from'] = tr['data-name-from']
        print (flight_data)

except AttributeError as e:
    raise ValueError("No valid table found")

P.S.S: All thanks to @amow for his great help :D

Upvotes: 0

Views: 1256

Answers (1)

amow
amow

Reputation: 2223

Start with table as your table in html.

heads = [i.text.strip() for i in table.select('thead th')]
for tr in table.select('tbody tr'):
    datas = [i.text.strip() for i in tr.select('td')]
    print dict(zip(heads, datas))

Output

{   
    u'STD': u'06:30',   
    u'Status': u'Scheduled',   
    u'ATD': u'-',  
    u'From': u'Singapore  (SIN)',  
    u'STA': u'07:55',  
    u'\xa0': u'', #This is the last column and have no meaning  
    u'To': u'Penang  (PEN)',  
    u'Aircraft': u'-',  
    u'Date': u'2015-04-19'
}

If you want to get the data in the tr tag. Just use

tr['data-data'] tr['data-flight-number']

and so on.

Upvotes: 4

Related Questions