drop tags in python

Question

Given I have strings how can I drop all the tags. For example:

string = hellothere I  want to  strip all these tags
>>>> hello there I want to strip all these tags

Antti Haapala -- Слава Україні · Accepted Answer

The text attribute is the most straightforward one, but it just copies the text nodes verbatim, thus you get

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup("""hellothere I  want to  strip all these tags""")
>>> soup.text
u'hellothere I  want to  strip all these tags'

You can squeeze all whitespace with

>>> ' '.join(soup.text.split())
u'hellothere I want to strip all these tags'

Now, the space missing between 'hello' and 'there' is a tricky one because if the were then it would be rendered by user agents as hellothere, without any intervening space; one needs to parse CSS to know which elements are supposed to be inline and which ones not.

However if we allow each non-text node (and closing tags) be replaced by space, a crude one would be to search all text nodes separately with soup.findChildren, split each of them separately, merge these lists with itertools.chain and then join them all together with a single space as separator:

>>> from itertools import chain >>> words = chain(*(i.split() for i in soup.findChildren(text=True))) >>> ' '.join(words) u'hello there I want to strip all these tags'

drop tags in python

Answers (1)

Related Questions