kylerthecreator
kylerthecreator

Reputation: 1569

Parsing and Storing HTML Tags Along With Text

I'm looking for a way to parse unicode strings of html and essentially split all of the elements of the string (html elements as well as individual tokens) and store them in a list. BeautifulSoup obviously has some nice functionality for parsing html, such as the .get_text method, but this doesn't preserve the tags themselves.

What I need is something like this. Given an html unicode string such as

s = u'<b>This is some important text!</b>,

what I would like to have as a result is a list like this:

['<b>', 'This', 'is', 'some', 'important', 'text!', '</b>']

There must be an easy way to do this with BeautifulSoup that I'm just not seeing in SO searches. Thanks for reading.

EDIT: since this has been getting some questions as to the purpose of storing the tags, I'm interested in using the tags as features for a project in text classification. I'm experimenting with using different structural features from an online discussion forum in addition to the n-grams present within forum posts.

Upvotes: 2

Views: 468

Answers (1)

mhawke
mhawke

Reputation: 87074

It's a bit of an odd requirement, so here's an odd but simple solution:

from bs4 import BeautifulSoup

s = u'<b>This is some important text!</b>'
soup = BeautifulSoup(s)

>>> soup.b.prettify().split()
[u'<b>', u'This', u'is', u'some', u'important', u'text!', u'</b>']

Upvotes: 3

Related Questions