Split HTML (or XML) node text by tags

Question

I have some HTML that looks like this:


Bla bla bla bold stuff Bla bla.
But somewhere else the words bold stuff may appear not in bold

I would like to parse this text to extract the bold elements, and the non-bolded elements as separate lists:

bolds = ['bold stuff']
normal_test = [
    'Bla bla bla ', 
    'Bla bla.
But somewhere else the words bold stuff may appear not in bold'
]

I may be being stupid, but I can't figure out how to do this using "standard" html parsers.

I can extract the full text of the element, including bolds, and i can extract the bolds, but i'm finding it impossible to figure out what the text before and after each bold is, because of the problem of possible dupe non-bold strings.

I'm using lxml, but willing to consider solutions with other parsers, or any clever xpath selectors i don't know about...

But, otherwise, I'm about to resort to regular expressions... Which, as we all know, will be the end of the world

Can someone save the Earth before it's too late?

Split HTML (or XML) node text by tags

Answers (1)

Related Questions