PythonLearner
PythonLearner

Reputation: 93

Scraping a specific group of <li> nested in <ul>

I am trying to fetch a specific group of li nested in ul. Below is my starting code. The data I am trying to fetch is at https://www.rki.de/DE/Content/InfAZ/N/Neuartiges_Coronavirus/Risikogebiete_neu.html. I highlighted the block of li(s) that I wanted to fetch.

> import requests from bs4 import BeautifulSoup
> # print(soup.prettify()) 
> page = requests.get('https://www.rki.de/DE/Content/InfAZ/N/Neuartiges_Coronavirus/Risikogebiete_neu.html').text
> 
> soup = BeautifulSoup(page, 'html.parser') 
> uls = soup.find_all('ul',id=None) 
> mine=[] 
> for ul in uls:
>     newsoup = BeautifulSoup(str(ul), 'html.parser')
>     lis = newsoup.find_all('li',id=None)
>     for li in lis:
>         mine.append(li.text)
>         print(li.text)

enter image description here

Upvotes: 0

Views: 682

Answers (4)

PythonLearner
PythonLearner

Reputation: 93

Below is the code that I ended up using to accomplish my tasker. I am sure it can further improved.

import requests, re
import pandas as pd
from bs4 import BeautifulSoup
# print(soup.prettify())
#print("Loading...")
page = requests.get('https://www.rki.de/DE/Content/InfAZ/N/Neuartiges_Coronavirus/Risikogebiete_neu.html').text
soup = BeautifulSoup(page, 'html.parser')
countries=[]
result = re.compile(r'Ebenso wird berücksichtigt, wenn keine verlässlichen Informationen für bestimmte Staaten vorliegen.',re.I)
lis=soup.find_all(text=result)[-1].findNext('ul').find_all('li')
for value in lis:
    if re.match(r"^<ul><li>.+",str(value)):
        pass
    elif re.match(r"^<li><p>.+",str(value)):
        pass
    else:
        countries.append(re.findall(r"(?<=\>)(.*?)(?=\()",str(value)))
   
flattened  = [val for sublist in countries for val in sublist]
df =pd.DataFrame(flattened, columns=['No_Go_Country'])
df.to_excel(r'C:\Users\Anodaram\Desktop\no_go_countries.xlsx',sheet_name='No_Go_Countries',index=None)

Upvotes: 0

ggorlen
ggorlen

Reputation: 57175

There are many ways to do this depending on your use case and expectations for the structure; if it's a one-time scrape or you anticipate text or markup will change.

One option is to pick the element that immediately precedes the sectionRelated class:

>>> import requests
>>> from bs4 import BeautifulSoup
>>> page = requests.get("https://www.rki.de/DE/Content/InfAZ/N/Neuartiges_Coronavirus/Risikogebiete_neu.html").text
>>> soup = BeautifulSoup(page, "html.parser")
>>> lis = soup.select_one(".sectionRelated").previous_sibling.previous_sibling.select("li")
>>> [x.text[:20] for x in lis]
['Rumänien: Gebiete („', 'Belgien: Provinz Ant', 'Bulgarien: Oblast Do']

Upvotes: 1

Konrad Rudolph
Konrad Rudolph

Reputation: 545943

This works:

token = 'Gebiete, die zu einem beliebigen Zeitpunkt in den vergangenen 14 Tagen Risikogebiete waren, aber derzeit KEINE mehr sind:'

no_longer_at_risk = soup.find_all(text=token)[0].findNext('ul').find_all('li')

This requires that the text we’re searching for doesn’t change — even just slightly! You could make it more robust by searching for a regular expression instead.

import re

token = re.compile(r'vergangen.*Risikogebiet.*keine.*mehr', re.I)
no_longer_at_risk = soup.find_all(text=token)[-1].findNext('ul').find_all('li')

But fundamentally the best way would probably be to iterate over all nodes in the document and check which matches the most of a list of tokens (e.g. ['Gebiet', 'Risikogebiet', 'vergangen', 'kein', 'mehr']).

Upvotes: 1

Code-Apprentice
Code-Apprentice

Reputation: 83557

One way to do this is with the Xpath. This allows you to select a specific element in the document by specifying the entire nesting from the top level. Note that this is very brittle because it will break if any nesting changes.

Upvotes: 1

Related Questions