Parsing and Storing HTML Tags Along With Text

Question

I'm looking for a way to parse unicode strings of html and essentially split all of the elements of the string (html elements as well as individual tokens) and store them in a list. BeautifulSoup obviously has some nice functionality for parsing html, such as the .get_text method, but this doesn't preserve the tags themselves.

What I need is something like this. Given an html unicode string such as

s = u'This is some important text!,

what I would like to have as a result is a list like this:

['', 'This', 'is', 'some', 'important', 'text!', '']

There must be an easy way to do this with BeautifulSoup that I'm just not seeing in SO searches. Thanks for reading.

EDIT: since this has been getting some questions as to the purpose of storing the tags, I'm interested in using the tags as features for a project in text classification. I'm experimenting with using different structural features from an online discussion forum in addition to the n-grams present within forum posts.

mhawke · Accepted Answer

It's a bit of an odd requirement, so here's an odd but simple solution:

from bs4 import BeautifulSoup

s = u'This is some important text!'
soup = BeautifulSoup(s)

>>> soup.b.prettify().split()
[u'', u'This', u'is', u'some', u'important', u'text!', u'']

Parsing and Storing HTML Tags Along With Text

Answers (1)

Related Questions