Scraping hidden content from a javascript webpage with python

Question

I'm trying to scrape the content from the following website:

https://mobile.admiral.at/en/event/event/all#/event/15a822ab-84a1-e511-90a2-000c297013a7

I have previously scraped the content successfully using dryscrape and the following code:

import dryscrape
import webkit_server
from lxml import html

session = dryscrape.Session()
session.set_timeout(20)
session.set_attribute('auto_load_images', False)
session.visit('https://mobile.admiral.at/en/event/event/all#/event/15a822ab-84a1-e511-90a2-000c297013a7')
response = session.body()
tree = html.fromstring(response)

print(tree.xpath('(//td[@class="team-name"]/text())[1]'))

The above example would print the home team (which in this case would be 'France')

It seems that the structure of the source has been changed, so I'm unable to scrape the contents properly.

What confuses me is that I'm able to see the tags using the Firefox Inspector tool, however it's not visible in the response when I pull the source.

I assume they must have hidden the content somehow to make it impossible (?) to scrape the data.

Could someone please point me in the right direction how to scrape the content properly.

Curro · Accepted Answer

The content that you need is loaded using jQuery (Ajax). I don't know if dryscrape has been updated lately, but the last time I used it didn't support ajax content loaded from jQuery...

Anyway.. just taking a look to the network inspector of chrome you will realize that the main content is loaded using an API. You can call to that API directly and you will get an awesome JSON with all the data of the page:

import requests
data = requests.get('https://mobile.admiral.at/;apiVer=json;api=main;jsonType=object;apiRw=1/en/api/event/get-event?id=15a822ab-84a1-e511-90a2-000c297013a7').json()

Scraping hidden content from a javascript webpage with python

Answers (1)

Related Questions