Reputation: 3321
I'm trying to get a specific text which is "D1. AGE".
I'm using the print(soup.find('tr',{'class':'subjectHeadRow'}).text)
method.
However, this gives me the following text:
D1. AGE Universe: Total population Reference tables: B01001 B16001 B09020
What is the best way to get the text "D1. AGE" only?
<tr class='subjectHeadRow'><th colspan='7'>D1. AGE<a href='./charts.php?p=37&g=05000US36003|04000US36|01000US&c=1' target='_blank' title='Chart data'><img src='/apps/elements/images/chart.png' class='iconButton noPrint' alt=''/></a>
<p class='subjectMeta'>Universe: Total population</p>
<p class='subjectMeta'>Reference tables: <a href='http://factfinder2.census.gov/bkmk/table/1.0/en/ACS/18_5YR/B01001/0500000US36003|0400000US36|0100000US' target='_blank'>B01001</a> <a href='http://factfinder2.census.gov/bkmk/table/1.0/en/ACS/18_5YR/B16001/0500000US36003|0400000US36|0100000US' target='_blank'>B16001</a> <a href='http://factfinder2.census.gov/bkmk/table/1.0/en/ACS/18_5YR/B09020/0500000US36003|0400000US36|0100000US' target='_blank'>B09020</a> </p></th></tr>
Another question: I want to search through an entire page to find all class
types with a td
tag, what would be the best way to achieve this? For instance in case where my page has the tags below and I want to return values
['indent0', 'value moeLow', 'value moeHigh']
<td class='indent0' title='TotPop'>Total population</td>
<td></td>
<td class='value moeLow' title='+/- 0.00% (47025, 47025)'>47,025</td>
<td class='value moeHigh' title='+/- 0.00% (19618452, 19618452)'>19,618,452</td>
<td></td>
Upvotes: 0
Views: 96
Reputation: 33384
To get the value D1. AGE
use find_next() after finding the element.then use contents[0]
html='''<tr class='subjectHeadRow'><th colspan='7'>D1. AGE<a href='./charts.php?p=37&g=05000US36003|04000US36|01000US&c=1' target='_blank' title='Chart data'><img src='/apps/elements/images/chart.png' class='iconButton noPrint' alt=''/></a>
<p class='subjectMeta'>Universe: Total population</p>
<p class='subjectMeta'>Reference tables: <a href='http://factfinder2.census.gov/bkmk/table/1.0/en/ACS/18_5YR/B01001/0500000US36003|0400000US36|0100000US' target='_blank'>B01001</a> <a href='http://factfinder2.census.gov/bkmk/table/1.0/en/ACS/18_5YR/B16001/0500000US36003|0400000US36|0100000US' target='_blank'>B16001</a> <a href='http://factfinder2.census.gov/bkmk/table/1.0/en/ACS/18_5YR/B09020/0500000US36003|0400000US36|0100000US' target='_blank'>B09020</a> </p></th></tr>'''
soup=BeautifulSoup(html,"html.parser")
print(soup.find('tr',{'class':'subjectHeadRow'}).find_next('th').contents[0])
For the second example use class=True
or css selector and then join the string.
html='''<td class='indent0' title='TotPop'>Total population</td>
<td></td>
<td class='value moeLow' title='+/- 0.00% (47025, 47025)'>47,025</td>
<td class='value moeHigh' title='+/- 0.00% (19618452, 19618452)'>19,618,452</td>
<td></td> '''
soup=BeautifulSoup(html,"html.parser")
tds=[' '.join(td['class']) for td in soup.find_all('td' , class_=True)]
print(tds)
# OR Css selector
tds=[' '.join(td['class']) for td in soup.select('td[class]')]
print(tds)
Output:
['indent0', 'value moeLow', 'value moeHigh']
['indent0', 'value moeLow', 'value moeHigh']
Upvotes: 1
Reputation: 84475
Looks like it is a child element so use a child combinator > to get child th of parent element with class subjectHeadRow then use stripped strings to get the string of interest with index 0
from bs4 import BeautifulSoup as bs
html = '''<table>
<tr class='subjectHeadRow'><th colspan='7'>D1. AGE<a href='./charts.php?p=37&g=05000US36003|04000US36|01000US&c=1' target='_blank' title='Chart data'><img src='/apps/elements/images/chart.png' class='iconButton noPrint' alt=''/></a>
<p class='subjectMeta'>Universe: Total population</p>
<p class='subjectMeta'>Reference tables: <a href='http://factfinder2.census.gov/bkmk/table/1.0/en/ACS/18_5YR/B01001/0500000US36003|0400000US36|0100000US' target='_blank'>B01001</a> <a href='http://factfinder2.census.gov/bkmk/table/1.0/en/ACS/18_5YR/B16001/0500000US36003|0400000US36|0100000US' target='_blank'>B16001</a> <a href='http://factfinder2.census.gov/bkmk/table/1.0/en/ACS/18_5YR/B09020/0500000US36003|0400000US36|0100000US' target='_blank'>B09020</a> </p></th></tr></table>
'''
soup = bs(html, 'lxml')
[string for string in soup.select_one('.subjectHeadRow th').stripped_strings][0]
Or use a generator and call once
gen = soup.select_one('.subjectHeadRow th').stripped_strings
next(gen)
Upvotes: 1