Edpy
Edpy

Reputation: 13

drop tags in python

Given I have strings how can I drop all the tags. For example:

string = hello<tag1>there</tag1> I <tag2> want to </tag2> strip <tag3>all </tag3>these tags
>>>> hello there I want to strip all these tags

Upvotes: 1

Views: 258

Answers (1)

The text attribute is the most straightforward one, but it just copies the text nodes verbatim, thus you get

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup("""hello<tag1>there</tag1> I <tag2> want to </tag2> strip <tag3>all </tag3>these tags""")
>>> soup.text
u'hellothere I  want to  strip all these tags'

You can squeeze all whitespace with

>>> ' '.join(soup.text.split())
u'hellothere I want to strip all these tags'

Now, the space missing between 'hello' and 'there' is a tricky one because if the <tag1> were <b> then it would be rendered by user agents as hellothere, without any intervening space; one needs to parse CSS to know which elements are supposed to be inline and which ones not.

However if we allow each non-text node (and closing tags) be replaced by space, a crude one would be to search all text nodes separately with soup.findChildren, split each of them separately, merge these lists with itertools.chain and then join them all together with a single space as separator:

>>> from itertools import chain
>>> words = chain(*(i.split() for i in soup.findChildren(text=True)))
>>> ' '.join(words)
u'hello there I want to strip all these tags'

Upvotes: 2

Related Questions