Willy
Willy

Reputation: 15

problems ... BeautifulSoup Parsing

<h2 class="sectionTitle">BACKGROUND</h2>
Mr. Paul J. Fribourg has bla bla</span>
<div style="margin-top:8px;">
    <a href="javascript:void(0)" onclick="show_more(this);">Read Full Background</a>
</div>

I would like to extract information from Mr. Paul to blabla Some webpage has <p> infront of Mr. Paul, so I could use FindNext('p') However, some webpages do not have <p> like the example above..

this is my code for when there is <p>

background = bs2.find(text=re.compile("BACKGROUND"))
bb= background.findNext('p').contents

But when I don't have <p> how I could extract information?

Upvotes: 0

Views: 848

Answers (2)

johnsyweb
johnsyweb

Reputation: 141770

It's hard to tell from the example you have given us, but it looks to me that you could just get the next node after an h2. In this example, Lewis Carroll has a p-aragraph tag and your friend Paul has only a closing span tag:

>>> from BeautifulSoup import BeautifulSoup
>>>
>>> html = '''
... <h2 class="sectionTitle">BACKGROUND</h2>
... <p>Mr. Lewis Carroll has bla bla</p>
... <div style="margin-top:8px;">
...     <a href="javascript:void(0)" onclick="show_more(this);">Read Full Background</a>
... </div>
... <h2 class="sectionTitle">BACKGROUND</h2>
... Mr. Paul J. Fribourg has bla bla</span>
... <div style="margin-top:8px;">
...     <a href="javascript:void(0)" onclick="show_more(this);">Read Full Background</a>
... </div>
... '''
>>>
>>> soup = BeautifulSoup(html)
>>> headings = soup.findAll('h2', text='BACKGROUND')
>>> for section in headings:
...     p = section.findNext('p')
...     if p:
...         print '> ',  p.string
...     else:
...         print '> ', section.parent.next.next.strip()
...
>  Mr. Lewis Carroll has bla bla
>  Mr. Paul J. Fribourg has bla bla

Following comments:

>>> from BeautifulSoup import BeautifulSoup
>>> from urllib2 import urlopen
>>> html = urlopen('http://investing.businessweek.com/research/stocks/private/person.asp?personId=668561&privcapId=160900&previousCapId=285930&previousTitle=LOEWS%20CORP')
>>> soup = BeautifulSoup(html)
>>> headings = soup.findAll('h2', text='BACKGROUND')
>>> for section in headings:
...     paragraph = section.findNext('p')
...     if paragraph and paragraph.string:
...         print '> ', paragraph.string
...     else:
...         print '> ', section.parent.next.next.strip()
... 
>  Mr. Paul J. Fribourg has been the President of Contigroup Companies Inc. (for [...]

You may, of course, wish to check copyright notices, et cetera...

Upvotes: 2

smci
smci

Reputation: 33940

"Some webpage has<p>infront of Mr. Paul, so I could use FindNext('p') However, some webpages do not have<p>like the example above."

You're not giving enough information to be able to recognize your string:

  • fixed node structure e.g. getChildren()[1].getChildren()[0].text
  • if it's preceded by the magic string 'BACKGROUND' as per your code, then your approach of finding the next node seems good - just don't build in an assumption that the tag name is 'p'
  • regex (e.g. "(Mr.|Ms.) ...")

Show us a HTML example when it does not have <p> in front of name?

Upvotes: 0

Related Questions