Reputation: 57
I am trying to extract all "Places of Interest" from Wikipedia pages using beautiful soup and Python/Pandas and put them into a dataframe. For example:
https://en.wikipedia.org/wiki/1st_arrondissement_of_Paris
url_Paris_01 = requests.get('https://en.wikipedia.org/wiki/1st_arrondissement_of_Paris').text
soup_Paris_01 = BeautifulSoup(url_Paris_01, "html.parser")
for headline in soup_Paris_01.find_all("span", {"class": "mw-headline"}):
print(headline.text)
Geography
Demography
Historical population
Immigration
Quarters
Economy
Education
Map
Cityscape
**Places of interest**
Bridges
Streets and squares
See also
References
External links
does not work
soup_Paris_01.find_all('li',attrs={"id":"Places_of_interest"})
I see that my "Places of Interest" all have a title tag.
Places of interest
Upvotes: 0
Views: 52
Reputation: 33384
First find the ul
item under place of interest
span tag and then do find_all() for all anchor tag under ul
item.
from bs4 import BeautifulSoup
import requests
url_Paris_01 = requests.get('https://en.wikipedia.org/wiki/1st_arrondissement_of_Paris').text
soup_Paris_01 = BeautifulSoup(url_Paris_01, "html.parser")
placeofinterset=soup_Paris_01.find("span",id="Places_of_interest").find_next('ul')
for place in placeofinterset.find_all('a'):
print(place['title']) #This will give you title
print(place.text) #This will give you text
Upvotes: 2