Reputation: 875
I want to scrape data from a webpage with a dynamic table. The table contains information on train rides.
This is the website: https://www.laerm-monitoring.de/zug/?mp=3/
I tried to request the data with a simple mounted request session, but I only got basic HTML data without the data from the table.
def requests_retry_session(
retries=3,
backoff_factor=0.3,
status_forcelist=(500, 502, 504, 429),
session=None,
):
session = session or requests.Session()
retry = Retry(
total=retries,
read=retries,
connect=retries,
backoff_factor=backoff_factor,
status_forcelist=status_forcelist,
)
adapter = HTTPAdapter(max_retries=retry)
session.mount('http://', adapter)
session.mount('https://', adapter)
return session
session = requests_retry_session()
response = session.get('https://www.laerm-monitoring.de/zug/?mp=3/')
response.content
How can I do this correctly?
Upvotes: 5
Views: 1933
Reputation: 9377
With a simple GET request you can retrieve the HTML of the landing page.
import requests
response = requests.get('https://www.laerm-monitoring.de/zug/') # even without query-parameters: ?mp=3/
print( response.content )
This can also be done in any browser.
In the source view (in Win/Linux: CRTL + U or in Mac: CMD + U) you will find the token needed for all subsequent requests against the REST API: __RequestVerificationToken
.
It's inside a hidden <input>
form-field one this page:
<input name="__RequestVerificationToken" type="hidden" value="CfDJ8B_eKmsiQC9Esc7ZjyC063dp6MzAtP3Sawnrfz3SCqxOMoPCYMV4sjDbrhDbuOsPcLnOiElgqQWTdMxCgfmhNVx1eC6oR81kZT3os2z3DJxtu6H9V7fKt9z9bdSJwB1ACYSSYWHsmPzt-AMWvSk4eYU" />
When the page loads in your browser this token will be used to load the data dynamically (as you already assumed) via JavaScript XMLHttpRequest
s (XHR).
To view these XHR requests open the Network tab of your browser's developer tools window (shortcut F12):
Both requests are fetching the measured data as JSON. For security reasons the called web API requires a token which is sent using a POST request. It's submitted in the body as x-www-form-urlencoded
along with the pagination parameters.
See following example from the command-line via cURL:
curl -vi 'https://www.laerm-monitoring.de/zug/train_read' -H 'Content-Type: application/x-www-form-urlencoded; charset=UTF-8' --data-raw 'sort=Einfahrtzeitdesc&page=1&pageSize=10&group=&filter=&__RequestVerificationToken=CfDJ8...
(token was shortened for illustration purpose)
Hint: in the browser's Network tab you can usually right-click on the request to copy as CURL command.
Upvotes: 3
Reputation: 1134
I have used Selenium to do something similar with python. Not sure if that works for your. Basically open the website and right click on table and do inspect element
. After that Go over to the div
that the table belongs to and right-click
to copy full xpath
. After you found the xpath, you can scrape it using selenium. See this answer .
The only problem is that Selenium actually opens the browser and doesn't run in background. I think you can do it silently, but I have never done it.
Another thing is that websites can block you if repeated automated requests come from a single IP. You can use tor to make request from a new IP every time you make a request. I have done something like that with twitter here.
Upvotes: 1
Reputation: 195438
The data is loaded dynamically from different URL. You can use this example how to load it just with requests
/beautifulsoup
:
import json
import requests
from bs4 import BeautifulSoup
data = {
"sort": "Einfahrtzeit-desc",
"page": "1",
"pageSize": "10",
"group": "",
"filter": "",
"__RequestVerificationToken": "",
"locid": "1",
}
headers = {"X-Requested-With": "XMLHttpRequest"}
url = "https://www.laerm-monitoring.de/zug/"
api_url = "https://www.laerm-monitoring.de/zug/train_read"
with requests.Session() as s:
soup = BeautifulSoup(s.get(url).content, "html.parser")
data["__RequestVerificationToken"] = soup.select_one(
'[name="__RequestVerificationToken"]'
)["value"]
data = s.post(api_url, data=data, headers=headers).json()
# pretty print the data
print(json.dumps(data, indent=4))
Prints:
{
"Data": [
{
"id": 2536954,
"Einfahrtzeit": "2021-04-24T20:56:26.1703+02:00",
"Gleis": 1,
"Richtung": "Kiel",
"Category": "PZ",
"Zugkategorie": "Personenzug",
"Vorbeifahrtdauer": 7.3,
"Zugl\u00e4nge": 181.85884,
"Geschwindigkeit": 115.57797,
"Maximalpegel": 88.611084,
"Vorbeifahrtpegel": 85.421326,
"G\u00fcltig": "OK"
},
{
"id": 2536944,
"Einfahrtzeit": "2021-04-24T20:52:25.1703+02:00",
"Gleis": 2,
"Richtung": "Hamburg",
"Category": "PZ",
"Zugkategorie": "Personenzug",
"Vorbeifahrtdauer": 6.3,
"Zugl\u00e4nge": 211.10226,
"Geschwindigkeit": 152.60104,
"Maximalpegel": 91.81743,
"Vorbeifahrtpegel": 87.95224,
"G\u00fcltig": "OK"
},
{
"id": 2536929,
"Einfahrtzeit": "2021-04-24T20:44:31.4703+02:00",
"Gleis": 1,
"Richtung": "Kiel",
"Category": "PZ",
"Zugkategorie": "Personenzug",
"Vorbeifahrtdauer": 5.3,
"Zugl\u00e4nge": 104.69964,
"Geschwindigkeit": 110.10052,
"Maximalpegel": 82.100815,
"Vorbeifahrtpegel": 79.98168,
"G\u00fcltig": "OK"
},
{
"id": 2536924,
"Einfahrtzeit": "2021-04-24T20:42:30.3703+02:00",
"Gleis": 1,
"Richtung": "Kiel",
"Category": "PZ",
"Zugkategorie": "Personenzug",
"Vorbeifahrtdauer": 2.9,
"Zugl\u00e4nge": 49.305683,
"Geschwindigkeit": 125.18,
"Maximalpegel": 98.63289,
"Vorbeifahrtpegel": 97.25019,
"G\u00fcltig": "OK"
},
{
"id": 2536925,
"Einfahrtzeit": "2021-04-24T20:42:20.5703+02:00",
"Gleis": 2,
"Richtung": "Hamburg",
"Category": "PZ",
"Zugkategorie": "Personenzug",
"Vorbeifahrtdauer": 0.0,
"Zugl\u00e4nge": 0.0,
"Geschwindigkeit": 0.0,
"Maximalpegel": 0.0,
"Vorbeifahrtpegel": 0.0,
"G\u00fcltig": "-"
},
{
"id": 2536911,
"Einfahrtzeit": "2021-04-24T20:35:19.3703+02:00",
"Gleis": 1,
"Richtung": "Kiel",
"Category": "PZ",
"Zugkategorie": "Personenzug",
"Vorbeifahrtdauer": 4.1,
"Zugl\u00e4nge": 103.97647,
"Geschwindigkeit": 132.2034,
"Maximalpegel": 87.111984,
"Vorbeifahrtpegel": 85.6776,
"G\u00fcltig": "OK"
},
{
"id": 2536907,
"Einfahrtzeit": "2021-04-24T20:33:31.2703+02:00",
"Gleis": 2,
"Richtung": "Hamburg",
"Category": "GZ",
"Zugkategorie": "G\u00fcterzug",
"Vorbeifahrtdauer": 23.8,
"Zugl\u00e4nge": 583.19586,
"Geschwindigkeit": 95.63598,
"Maximalpegel": 88.02967,
"Vorbeifahrtpegel": 85.02115,
"G\u00fcltig": "OK"
},
{
"id": 2536890,
"Einfahrtzeit": "2021-04-24T20:25:36.1703+02:00",
"Gleis": 2,
"Richtung": "Hamburg",
"Category": "PZ",
"Zugkategorie": "Personenzug",
"Vorbeifahrtdauer": 3.5,
"Zugl\u00e4nge": 104.63446,
"Geschwindigkeit": 160.47487,
"Maximalpegel": 88.60612,
"Vorbeifahrtpegel": 86.46721,
"G\u00fcltig": "OK"
},
{
"id": 2536882,
"Einfahrtzeit": "2021-04-24T20:22:05.8703+02:00",
"Gleis": 2,
"Richtung": "Hamburg",
"Category": "GZ",
"Zugkategorie": "G\u00fcterzug",
"Vorbeifahrtdauer": 26.6,
"Zugl\u00e4nge": 653.52515,
"Geschwindigkeit": 94.59859,
"Maximalpegel": 91.9396,
"Vorbeifahrtpegel": 85.50632,
"G\u00fcltig": "OK"
},
{
"id": 2536869,
"Einfahrtzeit": "2021-04-24T20:16:24.3703+02:00",
"Gleis": 1,
"Richtung": "Kiel",
"Category": "PZ",
"Zugkategorie": "Personenzug",
"Vorbeifahrtdauer": 3.3,
"Zugl\u00e4nge": 87.8222,
"Geschwindigkeit": 160.01207,
"Maximalpegel": 91.3928,
"Vorbeifahrtpegel": 89.54336,
"G\u00fcltig": "OK"
}
],
"Total": 8657,
"AggregateResults": null,
"Errors": null
}
Upvotes: 3