Finding Tags And Text In BeautifulSoup

Question

I'm having some trouble formulating a findAll query for BeautifulSoup that'll do what I want. Previously, I was using findAll to extract only the text from some html, essentially stripping away all the tags. For example, if I had:

Cows are being abducted by aliens according to the
Washington Post.

I initially thought that the following would do the trick:

''.join(html.findAll({'a':True}, text=True))

However, this doesn't work, since the text=True seems to indicate that it will only find text. What I'm in need of is some OR option - I would like to find text OR tags. It's important that the tags stay around the text they are tagging - I can't have the tags or text appearing out of order.

Any thoughts?

Ocaj Nires · Accepted Answer

Note: The BeautifulSoup.findAll is a search API. The first named argument of findAll which is name can be used to restrict the search to a given set of tags. With just a single findAll it is not possible to select all text between tags and at the same time select the text and tag for . So I came up with the below solution.

This solution depends on BeautifulSoup.Tag being imported.

from BeautifulSoup import BeautifulSoup, Tag

soup = BeautifulSoup('Cows are being abducted by aliens according to the navigating the parse tree API.

for item in soup.contents:
    if type(item) is Tag and u'a' != item.name:
        parsed_soup += ''.join(item.findAll(text = True))
    else:
        parsed_soup += unicode(item)


The order of the text is preserved.

 >>> print parsed_soup
 u'Cows are being abducted by aliens according to the Washington Post.'

Finding Tags And Text In BeautifulSoup

Answers (1)

Related Questions