Daniela
Daniela

Reputation: 911

Extracting website data with BeautifulSoup

I'm trying to extract timetable data from this site. The content is contained in a div with class .departures-table. I want to ignore the first two rows and store the data in an array, but it's not working. I'm obviously making an error but can't figure which one. Thanks

    snav_live_departures_url = "https://www.snav.it/"
    headers = {'user-agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.3'}
    request = urllib.request.Request(snav_live_departures_url,headers=headers)
    html = urllib.request.urlopen(request).read()
    soup = BeautifulSoup(html,'html.parser')
    snav_live_departures = []
    snav_live_departures_table = list(soup.select('.departures-table div')) [2:]
print(snav_live_departures_table)
for div in snav_live_departures_table:
    div = div.select('departures-row')
    snav_live_departures.append({
        'TIME':div[4].text,
        'DEPARTURE HARBOUR':div[0].text,
        'ARRIVAL HARBOUR':div[1].text,
        'STATUS':td[3].select('span.tt-text')[0].text,
        'PURCHASE LINK':div[6].select('a')[0].attrs['href']
    })

Upvotes: 0

Views: 69

Answers (2)

Prayson W. Daniel
Prayson W. Daniel

Reputation: 15568

As mentioned, when dealing with JavaScript heavy pages like this, you may want to monitor Network on your Dev Tools in your browser to see how data loaded.

This code will generate a beautiful dictionary for you to parse the data as you want:

import requests
import json

URL = 'https://booking.snav.it/api/v1/dashboard/nextDepartures?callback=jQuery12345&_=12345'

r = requests.get(URL)
s = r.content.decode('utf-8')
data = json.loads(s[16:len(s)-2])

Upvotes: 0

mdaniel
mdaniel

Reputation: 33168

There are a few different things going on here:

  1. The html does not contain the data you want, it is loaded via a JavaScript callback, which one can easily see by looking at the output of the page source, and also seeing their API call in the developer tools
  2. You actually were "lucky" there was no data in the page, otherwise this code would have exploded with NameError since td is not in scope:

        'DEPARTURE HARBOUR':td[0].text,
    
  3. It's unclear what you were even trying do do with that line, since those children are not <td> elements anyway, they're all <div>s

I think you will likely be happiest just mimicking the API call, stripping the JS callback text off of the response, and using the structured data:

fh = urllib.request.urlopen(api_url)
js_text = fh.read().decode('utf-8')
fh.close()
json_text = re.replace(r"^[^(]+\(", "", re.replace(r"\);$", "", js_text))
data = json.loads(json_text)

Upvotes: 2

Related Questions