Reputation: 1569
I'm looking for a way to parse unicode strings of html and essentially split all of the elements of the string (html elements as well as individual tokens) and store them in a list. BeautifulSoup
obviously has some nice functionality for parsing html, such as the .get_text
method, but this doesn't preserve the tags themselves.
What I need is something like this. Given an html unicode string such as
s = u'<b>This is some important text!</b>
,
what I would like to have as a result is a list like this:
['<b>', 'This', 'is', 'some', 'important', 'text!', '</b>']
There must be an easy way to do this with BeautifulSoup that I'm just not seeing in SO searches. Thanks for reading.
EDIT: since this has been getting some questions as to the purpose of storing the tags, I'm interested in using the tags as features for a project in text classification. I'm experimenting with using different structural features from an online discussion forum in addition to the n-grams present within forum posts.
Upvotes: 2
Views: 468
Reputation: 87074
It's a bit of an odd requirement, so here's an odd but simple solution:
from bs4 import BeautifulSoup
s = u'<b>This is some important text!</b>'
soup = BeautifulSoup(s)
>>> soup.b.prettify().split()
[u'<b>', u'This', u'is', u'some', u'important', u'text!', u'</b>']
Upvotes: 3