Markum
Markum

Reputation: 4059

Getting specific data with BeautifulSoup

I used page.prettify() to tidy up the HTML, and this is the text that I want to extract now:

        <div class="item">
         <b>
          name
         </b>
         <br/>
         stuff here
        </div>

My target is to extract the stuff here from there, but I am stumped as it is not wrapped in any tags except that div, which has other stuff in it already. And also the additional whitespace in front of every line makes it harder.

What would be the way to do this?

Upvotes: 0

Views: 446

Answers (3)

Acorn
Acorn

Reputation: 50597

You could use the .contents property of the div element to get all elements directly within it, then pick out the one that's a string.

Edit:

This was the approach I was alluding to:

from bs4 import BeautifulSoup
from bs4.element import NavigableString

soup = BeautifulSoup("""<div class='item'> <b> name </b>  <br/>  stuff here </div>""")
div = soup.find('div')
print ''.join([el.strip() for el in div.contents if type(el) == NavigableString])

Upvotes: 0

ditkin
ditkin

Reputation: 7054

A combination of find and nextSibling works for the example that you posted.

soup = BeautifulSoup(""" <div class="item"> <b> name </b>  <br/>  stuff here </div>""")
soup.find("div", "item").find('br').nextSibling

Upvotes: 2

subiet
subiet

Reputation: 1399

If you are really sure, you want to pick up content ending just before the last and starting after a particular tag, you can use RegExp after this point, not the most elegant, but if your are requirements are specific, it might work.

Upvotes: 1

Related Questions