cryptic_star
cryptic_star

Reputation: 1873

Finding Tags And Text In BeautifulSoup

I'm having some trouble formulating a findAll query for BeautifulSoup that'll do what I want. Previously, I was using findAll to extract only the text from some html, essentially stripping away all the tags. For example, if I had:

<b>Cows</b> are being abducted by aliens according to the
<a href="www.washingtonpost.com>Washington Post</a>.

It would be reduced to:

Cows are being abducted by aliens according to the Washington Post.

I would do this by using ''.join(html.findAll(text=True)). This was working great, until I decided I would like to keep only the <a> tags, but strip the rest of the tags away. So, given the initial example, I would end up with this:

Cows are being abducted by aliens according to the
<a href="www.washingtonpost.com>Washington Post</a>.

I initially thought that the following would do the trick:

''.join(html.findAll({'a':True}, text=True))

However, this doesn't work, since the text=True seems to indicate that it will only find text. What I'm in need of is some OR option - I would like to find text OR <a> tags. It's important that the tags stay around the text they are tagging - I can't have the tags or text appearing out of order.

Any thoughts?

Upvotes: 3

Views: 4330

Answers (1)

Ocaj Nires
Ocaj Nires

Reputation: 3355

Note: The BeautifulSoup.findAll is a search API. The first named argument of findAll which is name can be used to restrict the search to a given set of tags. With just a single findAll it is not possible to select all text between tags and at the same time select the text and tag for <a>. So I came up with the below solution.

This solution depends on BeautifulSoup.Tag being imported.

from BeautifulSoup import BeautifulSoup, Tag

soup = BeautifulSoup('<b>Cows</b> are being abducted by aliens according to the <a href="www.washingtonpost.com>Washington Post</a>.')
parsed_soup = ''

We navigate the parse tree like a list with the contents method. We extract text only when it's a tag and when the tag is not <a>. Otherwise we get the entire string with tag included. This uses navigating the parse tree API.

for item in soup.contents:
    if type(item) is Tag and u'a' != item.name:
        parsed_soup += ''.join(item.findAll(text = True))
    else:
        parsed_soup += unicode(item)

The order of the text is preserved.

 >>> print parsed_soup
 u'Cows are being abducted by aliens according to the <a href=\'"www.washingtonpost.com\'>Washington Post</a>.'

Upvotes: 4

Related Questions