hwjp
hwjp

Reputation: 16071

Split HTML (or XML) node text by tags

I have some HTML that looks like this:

<div>
Bla bla bla <b>bold stuff</b> Bla bla.
But somewhere else the words bold stuff may appear not in bold
</div>

I would like to parse this text to extract the bold elements, and the non-bolded elements as separate lists:

bolds = ['bold stuff']
normal_test = [
    'Bla bla bla ', 
    'Bla bla.\nBut somewhere else the words bold stuff may appear not in bold'
]

I may be being stupid, but I can't figure out how to do this using "standard" html parsers.

I can extract the full text of the element, including bolds, and i can extract the bolds, but i'm finding it impossible to figure out what the text before and after each bold is, because of the problem of possible dupe non-bold strings.

I'm using lxml, but willing to consider solutions with other parsers, or any clever xpath selectors i don't know about...

But, otherwise, I'm about to resort to regular expressions... Which, as we all know, will be the end of the world

Can someone save the Earth before it's too late?

Upvotes: 0

Views: 78

Answers (1)

hwjp
hwjp

Reputation: 16071

So I thought this was impossible, but it turns out that if you use the right library, it's not too hard.

With BeautifulSoup 4, you'd use the .children attribute:

html = '''<div>
Bla bla bla <b>bold stuff</b> Bla bla.
But somewhere else the words bold stuff may appear not in bold
</div>'''
import bs4
soup = bs4.BeautifulSoup(html)
print(list(soup.div.children))
[u'\nBla bla bla ',
 <b>bold stuff</b>,
 u' Bla bla.\nBut somewhere else the words bold stuff may appear not in bold\n']

And from that it's fairly trivial to achieve what I want.

I'd still be interested if anyone can do it with lxml?

Upvotes: 1

Related Questions