Reputation: 53
I write a script to capture the independence date of few countries on Wikipedia.
For example, with the Kazakhstan:
URL_QS = 'https://en.wikipedia.org/wiki/Kazakhstan'
r = requests.get(URL_QS)
soup = BeautifulSoup(r.text, 'lxml')
# Only keep the infobox (top right)
infobox = soup.find("table", class_="infobox geography vcard")
if infobox:
formation = infobox.find_next(text = re.compile("Formation"))
if formation:
independence = formation.find_next(text = re.compile("independence"))
if independence:
independ_date = independence.find_next("td").text
else:
independence = formation.find_next(text = re.compile("Independence"))
if independence:
independ_date = independence.find_next("td").text
print(independ_date)
And I have the following output:
Almaty
This output is not localised in the infobox but after, in the text. It's because "formation.find_next(text = re.compile("independence"))" found something outside of the infobox but I don't understand why the research should not be done only in the infobox ? How can I just search in this field ?
Thank you in advance for your help!
Upvotes: 5
Views: 289
Reputation: 918
Your code was searching for the value after the first "independence"
word which should be the second, also, the "Formation"
string does not generalize well as I tested on some countries, therefore I think you can search on "Independence"
from the beginning:
infobox = soup.find("table", class_="infobox geography vcard")
if infobox:
formation = infobox.find_next(text = re.compile("Independence"))
if formation:
independence = formation.find_next(text = re.compile("independence"))
if independence:
independence = infobox.find_next(text = re.compile("Independence"))
independ_date = independence.find_next("td").text
print(independ_date)
This will return the first date in the independence section of the wikipedia page for any country with an independence date.
Upvotes: 0
Reputation: 971
It's because "formation.find_next(text = re.compile("independence"))" found something outside of the infobox
add .extract()
to your soup.find()
to search only inside the infobox geography vcard
element.
infobox = soup.find("table", class_="infobox geography vcard").extract()
Upvotes: 1