Reputation: 185
Sorry, kind of a beginner question about BeatifulSoup, but I can't find the answer.
I'm having trouble figuring out how to scrape HTML tags without attributes.
Here's the section of code.
<tr bgcolor="#ffffff">
<td>
No-Lobbying List
</td>
<tr bgcolor="#efefef">
<td rowspan="2" valign="top">
6/24/2019
</td>
<td>
<a href="document.cfm?id=322577" target="_blank">
Brian Manley, Chief of Police, Austin Police Department
</a>
<a href="document.cfm?id=322577" target="_blank">
<img alt="Click here to download the PDF document" border="0" height="16" src="https://assets.austintexas.gov/edims/images/pdf_icon.gif" width="16"/>
</a>
</td>
<tr bgcolor="#efefef">
<td>
Preliminary 2018 Annual Crime Report - Executive Summary
</td>
</tr>
</tr>
</tr>
How can I navigate to the tag with the text "Preliminary 2018 Annual Crime Report - Executive Summary"?
I have tried moving from a with an attribute and using .next_sibling, but I've failed miserable.
Thank you.
trgrewy = soup.findAll('tr', {'bgcolor':'#efefef'}) #the cells alternate colors
trwhite = soup.findAll('tr', {'bgcolor':'#ffffff'})
trs = trgrewy + trwhite #merge them into a list
for item in trs:
mdate = item.find('td', {'rowspan':'2'}) #find if it's today's date
if mdate:
datetime_object = datetime.strptime(mdate.text, '%m/%d/%Y')
if datetime_object.date() == now.date():
sender = item.find('a').text
pdf = item.find('a')['href']
link = baseurl + pdf
title = item.findAll('td')[2] #this is where i've failed
Upvotes: 0
Views: 49
Reputation: 195573
You can use CSS selectors:
data = '''
<tr bgcolor="#ffffff">
<td>
No-Lobbying List
</td>
<tr bgcolor="#efefef">
<td rowspan="2" valign="top">
6/24/2019
</td>
<td>
<a href="document.cfm?id=322577" target="_blank">
Brian Manley, Chief of Police, Austin Police Department
</a>
<a href="document.cfm?id=322577" target="_blank">
<img alt="Click here to download the PDF document" border="0" height="16" src="https://assets.austintexas.gov/edims/images/pdf_icon.gif" width="16"/>
</a>
</td>
<tr bgcolor="#efefef">
<td>
Preliminary 2018 Annual Crime Report - Executive Summary
</td>
</tr>
</tr>
</tr>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'lxml')
# This will find date
print(soup.select_one('td[rowspan="2"]').get_text(strip=True))
# This will find next row after the row with date
print(soup.select_one('tr:has(td[rowspan="2"]) + tr').get_text(strip=True))
Prints:
6/24/2019
Preliminary 2018 Annual Crime Report - Executive Summary
Further reading:
Upvotes: 1
Reputation: 395
I think you should try this
page = BeautifulSoup(HTML_TEXT)
text = page.find('td').findAll(text=True, recursive=False)
for i in text:
print i
Upvotes: 0