Reputation: 41
I'm trying to parse some HTML that I've scraped and running into an odd issue. I need to find a <td>
tag that contains an <a>
tag with a certain name, and then I want to dump the contents of the entire <td>
tag. For now I'm just trying to get it to actually print the contents of the "name" attribute of the <a>
tag. My understanding is that if I have a specific element (as opposed to a list of elements), the "attrs" of that element should be a dictionary, and I should be able to pull out the value via string key:
soup = BeautifulSoup(html)
for tdblock in soup.findAll('td'):
try:
for ablock in tdblock.findAll('a'):
print ablock.attrs['name']
except AttributeError:
pass
(The try/except blocks are because not all the <td>
blocks in the HTML have <a>
blocks.)
But it throws a TypeError
:
Traceback (most recent call last):
File "fetch_historic_nfl_odds.py", line 26, in <module>
print ablock.attrs['name']
TypeError: list indices must be integers, not str
And if I modify the code to just print ablock.attrs, it's clearly a list, not a dictionary:
[(u'name', u'EMAIL')]
I've seen some stuff on stackoverflow indicating that you'll get a list if you try to parse the attributes of a findAll
, but I'm going element by element, so it's unclear why that would be the case.
I've also tried modifying things so it uses find()
to just get the first A item, but "attrs" is still a list.
Grabbing what I need by integer works, but I can't rely on the data I need always being at the same spot in the list. I know that I can use findAll
to search for specific elements by the actual attribute, but I need to match only the first few words of the string in the name attribute, so I don't think that would work.
EDIT: Here's a snippet of the HTML code I'm trying to parse, via soup.prettify():
<table width="644" border="0" cellpadding="3" cellspacing="0">
<tr>
<td>
<br />
<a name="Closing NFL Odds Week 1, 2006">
</a>
<center>
<font face="Georgia, Times New Roman, Times, serif">
<span style="font-size:14.0pt;font-family:Georgia">
<b>
Closing Las Vegas NFL Odds From Week 1, 2006
<br />
Week One NFL Football Odds
<br />
Pro Football Game Odds 9/7 - 9/11, 2006
</b>
</span>
</font>
</center>
What I'm looking for is to be able to check and see if that first <a>
tag has a "name" field that starts with "Closing NFL Odds", and if it does, return the whole <td>
block for additional parsing.
Further Edit: I'm using Python 2.7.12, and the non-bs4 BeautifulSoup, in case that's relevant.
Upvotes: 2
Views: 679
Reputation: 41
jwodder had it right; BeautifulSoup versions prior to version 4 seem to return lists for the attributes. I upgraded to bs4 and now it works. Thanks, all!
Upvotes: 1