Reputation: 33
I am trying to scrape Wikipedia using Beautiful Soup. I want to get the text inside , but only the contents of the row with a certain header text.
For example: I want to get the list of awards Alan Turing has received from https://en.wikipedia.org/wiki/Alan_Turing
The information I need is in the right table, in the table data corresponding to the table header with text Awards. How can I get the list of awards?
I have tried looping through the table rows and checking if table header is equal to 'Awards' but I don't know how to stop the loop in case there is no 'Awards' header in the table.
testurl = "https://en.wikipedia.org/wiki/Alan_Turing"
page = requests.get(testurl)
page_content = BeautifulSoup(page.content, "html.parser")
table = page_content.find('table' ,attrs={'class':'infobox biography vcard'})
while True:
tr = table.find('tr')
if tr.find('th').renderContents() == 'Awards':
td = tr.find('td')
break
print(td)
Upvotes: 0
Views: 801
Reputation: 195438
You can use CSS selector th:contains("Awards")
- that will select <th>
tag which contains text Awards
.
Then + td a[title]
will select next sibling <td>
and every <a>
tag with title=
attribute:
import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/wiki/Alan_Turing'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
awards = [a.text for a in soup.select('th:contains("Awards") + td a[title]')]
print(awards)
Prints:
["Smith's Prize"]
For url = 'https://en.wikipedia.org/wiki/Albert_Einstein'
it will print:
['Barnard Medal', 'Nobel Prize in Physics', 'Matteucci Medal', 'ForMemRS', 'Copley Medal', 'Gold Medal of the Royal Astronomical Society', 'Max Planck Medal', 'Member of the National Academy of Sciences', 'Time Person of the Century']
beautifulsoup4
version 4.10.0
th:contains
is now deprecated, use th:-soup-contains
instead of th:contains
.
awards = [a.text for a in soup.select('th:-soup-contains("Awards") + td a[title]')]
Upvotes: 2
Reputation: 160
Here's how you can access the 'Awards' part. Hope this is helpful to you
from bs4 import BeautifulSoup
import urllib.request
testurl = "https://en.wikipedia.org/wiki/Alan_Turing"
page = urllib.request.urlopen(testurl)
page_content = BeautifulSoup(page, "html.parser")
table = page_content.find('table' ,attrs={'class':'infobox biography vcard'})
for link in table.find_all('th'):
if link.text == 'Awards':
your_needed_variable = link.text
print(your_needed_variable)
Upvotes: 0