Reputation: 395
I am trying to scrape a list from the following URL: https://www.oncomap.de/centers?selectedOrgan=Darm&selectedCounty=Deutschland
Using Chrome's Developer Tools, I find that my content of interest is inside body > app-root > app-top > div ...
. I tried finding this content using Python's BeautifulSoup4
package. Unfortunately, it is not possible to dive into the structure beyond the app-root
tag. I am using the following code:
import requests from bs4 import BeautifulSoup import pprint headers = { 'Access-Control-Allow-Origin': '*', 'Access-Control-Allow-Methods': 'GET', 'Access-Control-Allow-Headers': 'Content-Type', 'Access-Control-Max-Age': '3600', 'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0' } url = 'https://www.oncomap.de/centers?selectedOrgan=Darm&selectedCounty=Deutschland' req = requests.get(url, headers) soup = BeautifulSoup(req.content, "html-parser") mat_row = soup.select('body > app-root') pp = pprint.PrettyPrinter() for child in mat_row[0].descendants: pp.pprint(child)
There is not output from this code - no descendant (also tried children
) is printed. I think I am dealing with a ReactJS div here. Would anyone have any hints how to process such content? Specifically, I am keen to scrape the main list on the page into a Python-readable table. THanks for your help!
Upvotes: 3
Views: 1523
Reputation: 195418
The data is loaded dynamically via JavaScript. But you can use requests
module to load the data:
import json
import requests
from bs4 import BeautifulSoup
clinics_url = 'https://back.oncomap.de/api/direct/fulldb_clinics'
centers_url = 'https://back.oncomap.de/api/direct/fulldb_centers'
data1 = requests.get(clinics_url).json()
data2 = requests.get(centers_url).json()
clinics = {d['clinic_nr']:d for d in data1}
# uncomment this to print all data:
# print(json.dumps(data1, indent=4))
# print(json.dumps(data2, indent=4))
for c in data2:
print(c['reg_nr'], c['inst1'], clinics.get(c['clinic_nr'], {}).get('inst1', '-'), c['url'], sep='\t')
Prints:
AB-Z001 G Brustzentrum Stuttgart am Marienhospital Marienhospital Stuttgart https://www.marienhospital-stuttgart.de/interdisziplinaere-zentren/brustzentrum/
FAB-Z007-1 G Universitäts-Brustzentrum Tübingen Universitätsklinikum Tübingen, CCC Tübingen-Stuttgart www.uni-frauenklinik-tuebingen.de/brustzentrum.html
FAB-Z010 G Interdisziplinäres Brustkrebszentrum der Charité (IBZ) im Charité Comprehensive Cancer Center Charité - Campus Mitte https://cccc.charite.de/leistungen/organbereiche/brustkrebs/
FAB-Z012-1 G Kooperatives Brustzentrum Klinikum Region Hannover KRH Klinikum Siloah www.krh.eu/klinikum/SOH/zentren/brustzentrum
FAB-Z016 G Brustzentrum Robert-Bosch-Krankenhaus Robert-Bosch-Krankenhaus; Klinik Schillerhöhe http://www.rbk.de/disziplinen/interdisziplinaere-zentren/brustzentrum.html
FAB-Z017 G Brustzentrum Halle des Universitätsklinikums Halle (Saale) Universitäts-Klinikum Halle-Saale www.unifrauenklinik-halle.de
FAB-Z020 G Brustzentrum im Sana Klinikum Lichtenberg Sana Klinikum Lichtenberg http://www.sana-kl.de/unser-leistungsspektrum/kliniken-institute/brustzentrum-des-sana-klinikum-lichtenberg.html
FAB-Z021 G Interdisziplinäres Brustzentrum der ALB FILS KLINIKEN Klinik am Eichert Göppingen www.alb-fils-kliniken.de
FAB-Z022 Kooperatives Brustzentrum Landshut Klinikum Landshut www.klinikum-landshut.de
FAB-Z023 G Brustzentrum Saar Mitte CaritasKlinikum Saarbrücken St. Theresia www.caritasklinik.de
FAB-Z024 G Brustzentrum am Universitätsklinikum Hamburg-Eppendorf Universitätsklinikum Hamburg-Eppendorf www.uke.de/kliniken-institute/zentren/brustzentrum/index.html
FAB-Z025-1 Südthüringer Brustzentrum Suhl / Meiningen SRH Zentralklinikum Suhl www.srh.de
FAB-Z026 G Brustzentrum Klinikum Oldenburg Klinikum Oldenburg www.klinikum-oldenburg.de
...and so on.
Upvotes: 1
Reputation: 1407
Since the page is dynamically loaded, you won't get the correct html by just scraping using the requests package.
What you can do instead, is scraping with a headless browser and make it wait until a specific element has appeared in the page.
Here it is a tutorial on web scraping with Selenium (package to handle headless browsers): https://www.scrapingbee.com/blog/selenium-python/
In that tutorial, there is also a section titled "waiting for an element to be present" that looks like what you are looking for.
Also, here it is a stackoverflow question related to what you want to do: Wait until page is loaded with selenium webdriver
Upvotes: 1