Reputation: 1873
I'm having some trouble formulating a findAll
query for BeautifulSoup that'll do what I want. Previously, I was using findAll
to extract only the text from some html, essentially stripping away all the tags. For example, if I had:
<b>Cows</b> are being abducted by aliens according to the
<a href="www.washingtonpost.com>Washington Post</a>.
It would be reduced to:
Cows are being abducted by aliens according to the Washington Post.
I would do this by using ''.join(html.findAll(text=True))
. This was working great, until I decided I would like to keep only the <a>
tags, but strip the rest of the tags away. So, given the initial example, I would end up with this:
Cows are being abducted by aliens according to the
<a href="www.washingtonpost.com>Washington Post</a>.
I initially thought that the following would do the trick:
''.join(html.findAll({'a':True}, text=True))
However, this doesn't work, since the text=True
seems to indicate that it will only find text. What I'm in need of is some OR option - I would like to find text OR <a>
tags. It's important that the tags stay around the text they are tagging - I can't have the tags or text appearing out of order.
Any thoughts?
Upvotes: 3
Views: 4330
Reputation: 3355
Note: The BeautifulSoup.findAll is a search API. The first named argument of findAll
which is name
can be used to restrict the search to a given set of tags. With just a single findAll
it is not possible to select all text between tags and at the same time select the text and tag for <a>
. So I came up with the below solution.
This solution depends on BeautifulSoup.Tag
being imported.
from BeautifulSoup import BeautifulSoup, Tag
soup = BeautifulSoup('<b>Cows</b> are being abducted by aliens according to the <a href="www.washingtonpost.com>Washington Post</a>.')
parsed_soup = ''
We navigate the parse tree like a list with the contents
method. We extract text only when it's a tag and when the tag is not <a>
. Otherwise we get the entire string with tag included. This uses navigating the parse tree API.
for item in soup.contents:
if type(item) is Tag and u'a' != item.name:
parsed_soup += ''.join(item.findAll(text = True))
else:
parsed_soup += unicode(item)
The order of the text is preserved.
>>> print parsed_soup
u'Cows are being abducted by aliens according to the <a href=\'"www.washingtonpost.com\'>Washington Post</a>.'
Upvotes: 4