jGsch
jGsch

Reputation: 53

Python & Beautiful Soup: Searching only in a certain class

I write a script to capture the independence date of few countries on Wikipedia.

For example, with the Kazakhstan:

URL_QS = 'https://en.wikipedia.org/wiki/Kazakhstan'
r = requests.get(URL_QS)
soup = BeautifulSoup(r.text, 'lxml')

# Only keep the infobox (top right)
infobox = soup.find("table", class_="infobox geography vcard")

if infobox:
    formation = infobox.find_next(text = re.compile("Formation"))

    if formation: 
        independence = formation.find_next(text = re.compile("independence")) 

        if independence:
            independ_date = independence.find_next("td").text
        else:
            independence = formation.find_next(text = re.compile("Independence"))

            if independence:
                independ_date = independence.find_next("td").text


print(independ_date)

And I have the following output:

Almaty

This output is not localised in the infobox but after, in the text. It's because "formation.find_next(text = re.compile("independence"))" found something outside of the infobox but I don't understand why the research should not be done only in the infobox ? How can I just search in this field ?

Thank you in advance for your help!

Upvotes: 5

Views: 289

Answers (2)

O.Suleiman
O.Suleiman

Reputation: 918

Your code was searching for the value after the first "independence" word which should be the second, also, the "Formation" string does not generalize well as I tested on some countries, therefore I think you can search on "Independence" from the beginning:

infobox = soup.find("table", class_="infobox geography vcard")

if infobox:
    formation = infobox.find_next(text = re.compile("Independence"))

    if formation: 
        independence = formation.find_next(text = re.compile("independence")) 

        if independence:
            independence = infobox.find_next(text = re.compile("Independence"))
            independ_date = independence.find_next("td").text

print(independ_date)

This will return the first date in the independence section of the wikipedia page for any country with an independence date.

Upvotes: 0

Nik Markin
Nik Markin

Reputation: 971

It's because "formation.find_next(text = re.compile("independence"))" found something outside of the infobox

add .extract() to your soup.find() to search only inside the infobox geography vcard element.

infobox = soup.find("table", class_="infobox geography vcard").extract()

Upvotes: 1

Related Questions