user1117603
user1117603

Reputation: 421

Beautiful Soup - Grabbing the string after the first specified tag

I'm trying to grab the string immediately after the opening <td> tag. The following code works:

webpage = urlopen(i).read()
soup = BeautifulSoup(webpage)
for elem in soup('td', text=re.compile(".\.doc")):
    print elem.parent

when the html looks like this:

<td>plan_49913.doc</td>

but not when the html looks like this:

<td>plan_49913.doc<br /> <font color="#990000">Document superseded by: &#160;</font><a href="/plans/Jan_2012.html">January 2012</a></td>

I've tried playing with attrs but can't get it to work. Basically I just want to grab 'plan_49913.doc' in either instance of html.

Any advice would be greatly appreciated.

Thank you in advance.

~chrisK

Upvotes: 0

Views: 2104

Answers (2)

Giacomo Lacava
Giacomo Lacava

Reputation: 1823

Just use the next property, it contains the next node, and that's a textual node.

>>> html = '<td>plan_49913.doc<br /> <font color="#990000">Document superseded by: &#160;</font><a href="/plans/Jan_2012.html">January 2012</a></td>'
>>> bs = BeautifulSoup(html)
>>> texts = [ node.next for node in bs.findAll('td') if node.next.endswith('.doc') ]
>>> texts
[u'plan_49913.doc']

you can change the if clause to use a regex if you prefer.

Upvotes: 0

jcollado
jcollado

Reputation: 40384

This works for me:

>>> html = '<td>plan_49913.doc<br /> <font color="#990000">Document superseded by: &#160;</font><a href="/plans/Jan_2012.html">January 2012</a></td>'
>>> soup = BeautifulSoup(html)
>>> soup.find(text=re.compile('.\.doc'))
u'plan_49913.doc

Is there something I'm missing?

Also, note that according to the documentation:

If you use text, then any values you give for name and the keyword arguments are ignored.

So you don't need to pass 'td' since it's already being ignored, that is, any text that matches under any other tag will be returned.

Upvotes: 1

Related Questions