problems ... BeautifulSoup Parsing

Question

BACKGROUND
Mr. Paul J. Fribourg has bla bla

    Read Full Background

I would like to extract information from Mr. Paul to blabla Some webpage has

infront of Mr. Paul, so I could use FindNext('p') However, some webpages do not have

like the example above..

this is my code for when there is

background = bs2.find(text=re.compile("BACKGROUND"))
bb= background.findNext('p').contents

But when I don't have

how I could extract information?

johnsyweb · Accepted Answer

It's hard to tell from the example you have given us, but it looks to me that you could just get the next node after an h2. In this example, Lewis Carroll has a p-aragraph tag and your friend Paul has only a closing span tag:

>>> from BeautifulSoup import BeautifulSoup
>>>
>>> html = '''
... BACKGROUND
... Mr. Lewis Carroll has bla bla
... 
...     Read Full Background
... 
... BACKGROUND
... Mr. Paul J. Fribourg has bla bla
... 
...     Read Full Background
... 
... '''
>>>
>>> soup = BeautifulSoup(html)
>>> headings = soup.findAll('h2', text='BACKGROUND')
>>> for section in headings:
...     p = section.findNext('p')
...     if p:
...         print '> ',  p.string
...     else:
...         print '> ', section.parent.next.next.strip()
...
>  Mr. Lewis Carroll has bla bla
>  Mr. Paul J. Fribourg has bla bla

Following comments:

>>> from BeautifulSoup import BeautifulSoup
>>> from urllib2 import urlopen
>>> html = urlopen('http://investing.businessweek.com/research/stocks/private/person.asp?personId=668561&privcapId=160900&previousCapId=285930&previousTitle=LOEWS%20CORP')
>>> soup = BeautifulSoup(html)
>>> headings = soup.findAll('h2', text='BACKGROUND')
>>> for section in headings:
...     paragraph = section.findNext('p')
...     if paragraph and paragraph.string:
...         print '> ', paragraph.string
...     else:
...         print '> ', section.parent.next.next.strip()
... 
>  Mr. Paul J. Fribourg has been the President of Contigroup Companies Inc. (for [...]

You may, of course, wish to check copyright notices, et cetera...

problems ... BeautifulSoup Parsing

Answers (2)

Related Questions