Reputation: 421
I'm trying to grab the string immediately after the opening <td>
tag. The following code works:
webpage = urlopen(i).read()
soup = BeautifulSoup(webpage)
for elem in soup('td', text=re.compile(".\.doc")):
print elem.parent
when the html looks like this:
<td>plan_49913.doc</td>
but not when the html looks like this:
<td>plan_49913.doc<br />
<font color="#990000">Document superseded by:  </font><a href="/plans/Jan_2012.html">January 2012</a></td>
I've tried playing with attrs but can't get it to work. Basically I just want to grab 'plan_49913.doc' in either instance of html.
Any advice would be greatly appreciated.
Thank you in advance.
~chrisK
Upvotes: 0
Views: 2104
Reputation: 1823
Just use the next
property, it contains the next node, and that's a textual node.
>>> html = '<td>plan_49913.doc<br /> <font color="#990000">Document superseded by:  </font><a href="/plans/Jan_2012.html">January 2012</a></td>'
>>> bs = BeautifulSoup(html)
>>> texts = [ node.next for node in bs.findAll('td') if node.next.endswith('.doc') ]
>>> texts
[u'plan_49913.doc']
you can change the if
clause to use a regex if you prefer.
Upvotes: 0
Reputation: 40384
This works for me:
>>> html = '<td>plan_49913.doc<br /> <font color="#990000">Document superseded by:  </font><a href="/plans/Jan_2012.html">January 2012</a></td>'
>>> soup = BeautifulSoup(html)
>>> soup.find(text=re.compile('.\.doc'))
u'plan_49913.doc
Is there something I'm missing?
Also, note that according to the documentation:
If you use text, then any values you give for name and the keyword arguments are ignored.
So you don't need to pass 'td'
since it's already being ignored, that is, any text that matches under any other tag will be returned.
Upvotes: 1