Reputation: 93
I am trying to fetch a specific group of li nested in ul. Below is my starting code. The data I am trying to fetch is at https://www.rki.de/DE/Content/InfAZ/N/Neuartiges_Coronavirus/Risikogebiete_neu.html. I highlighted the block of li(s) that I wanted to fetch.
> import requests from bs4 import BeautifulSoup
> # print(soup.prettify())
> page = requests.get('https://www.rki.de/DE/Content/InfAZ/N/Neuartiges_Coronavirus/Risikogebiete_neu.html').text
>
> soup = BeautifulSoup(page, 'html.parser')
> uls = soup.find_all('ul',id=None)
> mine=[]
> for ul in uls:
> newsoup = BeautifulSoup(str(ul), 'html.parser')
> lis = newsoup.find_all('li',id=None)
> for li in lis:
> mine.append(li.text)
> print(li.text)
Upvotes: 0
Views: 682
Reputation: 93
Below is the code that I ended up using to accomplish my tasker. I am sure it can further improved.
import requests, re import pandas as pd from bs4 import BeautifulSoup # print(soup.prettify()) #print("Loading...") page = requests.get('https://www.rki.de/DE/Content/InfAZ/N/Neuartiges_Coronavirus/Risikogebiete_neu.html').text soup = BeautifulSoup(page, 'html.parser') countries=[] result = re.compile(r'Ebenso wird berücksichtigt, wenn keine verlässlichen Informationen für bestimmte Staaten vorliegen.',re.I) lis=soup.find_all(text=result)[-1].findNext('ul').find_all('li') for value in lis: if re.match(r"^<ul><li>.+",str(value)): pass elif re.match(r"^<li><p>.+",str(value)): pass else: countries.append(re.findall(r"(?<=\>)(.*?)(?=\()",str(value))) flattened = [val for sublist in countries for val in sublist] df =pd.DataFrame(flattened, columns=['No_Go_Country']) df.to_excel(r'C:\Users\Anodaram\Desktop\no_go_countries.xlsx',sheet_name='No_Go_Countries',index=None)
Upvotes: 0
Reputation: 57175
There are many ways to do this depending on your use case and expectations for the structure; if it's a one-time scrape or you anticipate text or markup will change.
One option is to pick the element that immediately precedes the sectionRelated
class:
>>> import requests
>>> from bs4 import BeautifulSoup
>>> page = requests.get("https://www.rki.de/DE/Content/InfAZ/N/Neuartiges_Coronavirus/Risikogebiete_neu.html").text
>>> soup = BeautifulSoup(page, "html.parser")
>>> lis = soup.select_one(".sectionRelated").previous_sibling.previous_sibling.select("li")
>>> [x.text[:20] for x in lis]
['Rumänien: Gebiete („', 'Belgien: Provinz Ant', 'Bulgarien: Oblast Do']
Upvotes: 1
Reputation: 545943
This works:
token = 'Gebiete, die zu einem beliebigen Zeitpunkt in den vergangenen 14 Tagen Risikogebiete waren, aber derzeit KEINE mehr sind:'
no_longer_at_risk = soup.find_all(text=token)[0].findNext('ul').find_all('li')
This requires that the text we’re searching for doesn’t change — even just slightly! You could make it more robust by searching for a regular expression instead.
import re
token = re.compile(r'vergangen.*Risikogebiet.*keine.*mehr', re.I)
no_longer_at_risk = soup.find_all(text=token)[-1].findNext('ul').find_all('li')
But fundamentally the best way would probably be to iterate over all nodes in the document and check which matches the most of a list of tokens (e.g. ['Gebiet', 'Risikogebiet', 'vergangen', 'kein', 'mehr']
).
Upvotes: 1
Reputation: 83557
One way to do this is with the Xpath. This allows you to select a specific element in the document by specifying the entire nesting from the top level. Note that this is very brittle because it will break if any nesting changes.
Upvotes: 1