Reputation: 13
Given I have strings how can I drop all the tags. For example:
string = hello<tag1>there</tag1> I <tag2> want to </tag2> strip <tag3>all </tag3>these tags
>>>> hello there I want to strip all these tags
Upvotes: 1
Views: 258
Reputation: 134008
The text attribute is the most straightforward one, but it just copies the text nodes verbatim, thus you get
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup("""hello<tag1>there</tag1> I <tag2> want to </tag2> strip <tag3>all </tag3>these tags""")
>>> soup.text
u'hellothere I want to strip all these tags'
You can squeeze all whitespace with
>>> ' '.join(soup.text.split())
u'hellothere I want to strip all these tags'
Now, the space missing between 'hello'
and 'there
' is a tricky one because if the <tag1>
were <b>
then it would be rendered by user agents as hellothere, without any intervening space; one needs to parse CSS to know which elements are supposed to be inline and which ones not.
However if we allow each non-text node (and closing tags) be replaced by space, a crude one would be to search all text nodes separately with soup.findChildren
, split each of them separately, merge these lists with itertools.chain
and then join
them all together with a single space as separator:
>>> from itertools import chain
>>> words = chain(*(i.split() for i in soup.findChildren(text=True)))
>>> ' '.join(words)
u'hello there I want to strip all these tags'
Upvotes: 2