Reputation: 4059
I used page.prettify()
to tidy up the HTML, and this is the text that I want to extract now:
<div class="item">
<b>
name
</b>
<br/>
stuff here
</div>
My target is to extract the stuff here
from there, but I am stumped as it is not wrapped in any tags except that div
, which has other stuff in it already. And also the additional whitespace in front of every line makes it harder.
What would be the way to do this?
Upvotes: 0
Views: 446
Reputation: 50597
You could use the .contents
property of the div
element to get all elements directly within it, then pick out the one that's a string.
Edit:
This was the approach I was alluding to:
from bs4 import BeautifulSoup
from bs4.element import NavigableString
soup = BeautifulSoup("""<div class='item'> <b> name </b> <br/> stuff here </div>""")
div = soup.find('div')
print ''.join([el.strip() for el in div.contents if type(el) == NavigableString])
Upvotes: 0
Reputation: 7054
A combination of find and nextSibling works for the example that you posted.
soup = BeautifulSoup(""" <div class="item"> <b> name </b> <br/> stuff here </div>""")
soup.find("div", "item").find('br').nextSibling
Upvotes: 2
Reputation: 1399
If you are really sure, you want to pick up content ending just before the last and starting after a particular tag, you can use RegExp after this point, not the most elegant, but if your are requirements are specific, it might work.
Upvotes: 1