Volatil3
Volatil3

Reputation: 14978

Python Beautiful Soup: How to extract text next to a tag?

I have following HTML

<p>
<b>Father:</b> Michael Haughton
<br>
<b>Mother:</b> Diane
<br><b>Brother:</b> 
Rashad Haughton<br>
<b>Husband:</b> <a href="/people/540/000024468/">R. Kelly</a> (m. 1994, annulled that same year)
<br><b>Boyfriend:</b> <a href="/people/420/000109093/">Damon Dash</a> (Roc-a-Fella co-CEO)<br></p>

I have to separate heading and text, for instance, Mother: Diane..

So in the end I would have a list of dictionaries as:

[{"label":"Mother","value":"Diane"}]

I was trying the below but not working:

def parse(u):
    u = u.rstrip('\n')
    r = requests.get(u, headers=headers)
    if r.status_code == 200:
        html = r.text.strip()
        soup = BeautifulSoup(html, 'lxml')
        headings = soup.select('table p')
        for h in headings:
            b = h.find('b')
            if b is not None:
                print(b.text)
                print(h.text + '\n')
                print('=================================')


url = 'http://www.nndb.com/people/742/000024670/'

Upvotes: 2

Views: 1646

Answers (2)

Dmitriy Fialkovskiy
Dmitriy Fialkovskiy

Reputation: 3225

from bs4 import BeautifulSoup
from urllib.request import urlopen

#html = '''<p>
#<b>Father:</b> Michael Haughton
#<br>
#<b>Mother:</b> Diane
#<br><b>Brother:</b> 
#Rashad Haughton<br>
#<b>Husband:</b> <a href="/people/540/000024468/">R. Kelly</a> (m. 1994, annulled that same year)
#<br><b>Boyfriend:</b> <a href="/people/420/000109093/">Damon Dash</a> (Roc-a-Fella co-CEO)<br></p>'''

page = urlopen('http://www.nndb.com/people/742/000024670/')
source = page.read()

soup = BeautifulSoup(source)

needed_p = soup.find_all('p')[8]

bs = needed_p.find_all('b')

res = {}

for b in bs:
    if b.find_next('a').text:
        res[b.text] = b.find_next('a').text.strip().strip('\n')
    if b.next_sibling != ' ':
        res[b.text] = b.next_sibling.strip().strip('\n')

res

output:

{'Brother:': 'Rashad Haughton',
 'Mother:': 'Diane',
 'Husband:': 'R. Kelly',
 'Father:': 'Michael Haughton',
 'Boyfriend:': 'Damon Dash'}

EDIT: For additional info on top of the page:

... (code above) ...
soup = BeautifulSoup(source)

needed_p = soup.find_all('p')[1:4] + [soup.find_all('p')[8]] # here explicitly selecting needed p-tags for further parsing

res = {}

for p in needed_p:
    bs = p.find_all('b')
    for b in bs:
        if b.find_next('a').text:
            res[b.text] = b.find_next('a').text.strip().strip('\n')
        if b.next_sibling != ' ':
            res[b.text] = b.next_sibling.strip().strip('\n')

res

output:

{'Race or Ethnicity:': 'Black',
 'Husband:': 'R. Kelly',
 'Died:': '25-Aug',
 'Nationality:': 'United States',
 'Executive summary:': 'R&B singer, died in plane crash',
 'Mother:': 'Diane',
 'Birthplace:': 'Brooklyn, NY',
 'Born:': '16-Jan',
 'Boyfriend:': 'Damon Dash',
 'Sexual orientation:': 'Straight',
 'Occupation:': 'Singer',
 'Cause of death:': 'Accident - Airplane',
 'Brother:': 'Rashad Haughton',
 'Remains:': 'Interred,',
 'Gender:': 'Female',
 'Father:': 'Michael Haughton',
 'Location of death:': 'Marsh Harbour, Abaco Island, Bahamas'}

For precisely this page you also can scrape High school, for example, this way:

res['High School'] = soup.find_all('p')[9].text.split(':')[1].strip()

Upvotes: 1

Right leg
Right leg

Reputation: 16720

You're looking for the next_sibling tag attribute. This gives you either the next NavigableString or the next Tag, depending on what it finds first.

Here is how you can use it:

html = """..."""            
soup = BeautifulSoup(html)

bTags = soup.find_all('b')
for it_tag in bTags:
    print(it_tag.string)
    print(it_tag.next_sibling)

Output:

Father:
 Michael Haughton

Mother:
 Diane

Brother:

Rashad Haughton
Husband:

Boyfriend:

This seems a bit off. It's partly because of the line breaks and the blanks, which you can get rid of easily with the str.strip method.

Still, the Boyfriend and Husband entries are lacking a value. It's because next_sibling is either a NavigableString (ie a str) or a Tag. The blank between the <b> tag and the <a> tag here is interpreted as a non-empty text:

<b>Boyfriend:</b> <a href="/people/420/000109093/">Damon Dash</a>
                 ^

If it were absent, <b>Boyfriend:</b>'s next sibling would be the <a> tag. Since it's present, you have to check:

  • Whether the next sibling is a string or a tag;
  • If it is a string, whether it contains only whitespace.

If the next sibling is a whitespace-only string, then the information you're looking for is that NavigableString's next sibling, which will be a <a> tag.

Edited code:

bTags = soup.find_all('b')

for it_tag in bTags:
    print(it_tag.string)

    nextSibling = it_tag.next_sibling
    if nextSibling is not None:
        if isinstance(nextSibling, str):
            if nextSibling.isspace():
                print(it_tag.next_sibling.next_sibling.string.strip())
            else:
                print(nextSibling.strip())

        elif isinstance(it_tag.next_sibling, bs4.Tag):
            print(it_tag.next_sibling.string)

Output:

Father:
Michael Haughton
Mother:
Diane
Brother:
Rashad Haughton
Husband:
R. Kelly
Boyfriend:
Damon Dash

Now you can easily build your dictionary:

entries = {}
bTags = soup.find_all('b')

for it_tag in bTags:
    key = it_tag.string.replace(':', '')
    value = None

    nextSibling = it_tag.next_sibling
    if nextSibling is not None:
        if isinstance(nextSibling, str):
            if nextSibling.isspace():
                value = it_tag.next_sibling.next_sibling.string.strip()
            else:
                value = nextSibling.strip()

        elif isinstance(it_tag.next_sibling, bs4.Tag):
            value = it_tag.next_sibling.string

    entries[key] = value

Output dictionary:

{'Father': 'Michael Haughton',
 'Mother': 'Diane',
 'Brother': 'Rashad Haughton',
 'Husband': 'R. Kelly',
 'Boyfriend': 'Damon Dash'}

Upvotes: 0

Related Questions