Extracting website data with BeautifulSoup

Question

I'm trying to extract timetable data from this site. The content is contained in a div with class .departures-table. I want to ignore the first two rows and store the data in an array, but it's not working. I'm obviously making an error but can't figure which one. Thanks

    snav_live_departures_url = "https://www.snav.it/"
    headers = {'user-agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.3'}
    request = urllib.request.Request(snav_live_departures_url,headers=headers)
    html = urllib.request.urlopen(request).read()
    soup = BeautifulSoup(html,'html.parser')
    snav_live_departures = []
    snav_live_departures_table = list(soup.select('.departures-table div')) [2:]
print(snav_live_departures_table)
for div in snav_live_departures_table:
    div = div.select('departures-row')
    snav_live_departures.append({
        'TIME':div[4].text,
        'DEPARTURE HARBOUR':div[0].text,
        'ARRIVAL HARBOUR':div[1].text,
        'STATUS':td[3].select('span.tt-text')[0].text,
        'PURCHASE LINK':div[6].select('a')[0].attrs['href']
    })

Prayson W. Daniel · Accepted Answer

As mentioned, when dealing with JavaScript heavy pages like this, you may want to monitor Network on your Dev Tools in your browser to see how data loaded.

This code will generate a beautiful dictionary for you to parse the data as you want:

import requests
import json

URL = 'https://booking.snav.it/api/v1/dashboard/nextDepartures?callback=jQuery12345&_=12345'

r = requests.get(URL)
s = r.content.decode('utf-8')
data = json.loads(s[16:len(s)-2])

Extracting website data with BeautifulSoup

Answers (2)

Related Questions