Hamuel
Hamuel

Reputation: 33

How to scrape td corresponding to header text in Beautifulsoup

I am trying to scrape Wikipedia using Beautiful Soup. I want to get the text inside , but only the contents of the row with a certain header text.

For example: I want to get the list of awards Alan Turing has received from https://en.wikipedia.org/wiki/Alan_Turing

The information I need is in the right table, in the table data corresponding to the table header with text Awards. How can I get the list of awards?

I have tried looping through the table rows and checking if table header is equal to 'Awards' but I don't know how to stop the loop in case there is no 'Awards' header in the table.

testurl = "https://en.wikipedia.org/wiki/Alan_Turing"
page = requests.get(testurl)
page_content = BeautifulSoup(page.content, "html.parser")
table = page_content.find('table' ,attrs={'class':'infobox biography vcard'})
while True:
    tr = table.find('tr')
    if tr.find('th').renderContents() == 'Awards':
        td = tr.find('td')
        break
print(td)

Upvotes: 0

Views: 801

Answers (2)

Andrej Kesely
Andrej Kesely

Reputation: 195438

You can use CSS selector th:contains("Awards") - that will select <th> tag which contains text Awards.

Then + td a[title] will select next sibling <td> and every <a> tag with title= attribute:

import requests
from bs4 import BeautifulSoup


url = 'https://en.wikipedia.org/wiki/Alan_Turing'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')

awards = [a.text for a in soup.select('th:contains("Awards") + td a[title]')]
print(awards)

Prints:

["Smith's Prize"]

For url = 'https://en.wikipedia.org/wiki/Albert_Einstein' it will print:

['Barnard Medal', 'Nobel Prize in Physics', 'Matteucci Medal', 'ForMemRS', 'Copley Medal', 'Gold Medal of the Royal Astronomical Society', 'Max Planck Medal', 'Member of the National Academy of Sciences', 'Time Person of the Century']

Update 2021/10/31

beautifulsoup4 version 4.10.0

th:contains is now deprecated, use th:-soup-contains instead of th:contains.

example

awards = [a.text for a in soup.select('th:-soup-contains("Awards") + td a[title]')]

Upvotes: 2

Maxiboi
Maxiboi

Reputation: 160

Here's how you can access the 'Awards' part. Hope this is helpful to you

from bs4 import BeautifulSoup
import urllib.request

testurl = "https://en.wikipedia.org/wiki/Alan_Turing"
page = urllib.request.urlopen(testurl)
page_content = BeautifulSoup(page, "html.parser")
table = page_content.find('table' ,attrs={'class':'infobox biography vcard'})

for link in table.find_all('th'):
    if link.text == 'Awards':
        your_needed_variable = link.text

print(your_needed_variable)

Upvotes: 0

Related Questions