Python Beautiful Soup: How to extract text next to a tag?

Question

I have following HTML


Father: Michael Haughton


Mother: Diane

Brother: 
Rashad Haughton

Husband: R. Kelly (m. 1994, annulled that same year)

Boyfriend: Damon Dash (Roc-a-Fella co-CEO)

I have to separate heading and text, for instance, Mother: Diane..

So in the end I would have a list of dictionaries as:

[{"label":"Mother","value":"Diane"}]

I was trying the below but not working:

def parse(u):
    u = u.rstrip('
')
    r = requests.get(u, headers=headers)
    if r.status_code == 200:
        html = r.text.strip()
        soup = BeautifulSoup(html, 'lxml')
        headings = soup.select('table p')
        for h in headings:
            b = h.find('b')
            if b is not None:
                print(b.text)
                print(h.text + '
')
                print('=================================')


url = 'http://www.nndb.com/people/742/000024670/'

Dmitriy Fialkovskiy · Accepted Answer

from bs4 import BeautifulSoup
from urllib.request import urlopen

#html = '''
#Father: Michael Haughton
#

#Mother: Diane
#
Brother: 
#Rashad Haughton

#Husband: R. Kelly (m. 1994, annulled that same year)
#
Boyfriend: Damon Dash (Roc-a-Fella co-CEO)
'''

page = urlopen('http://www.nndb.com/people/742/000024670/')
source = page.read()

soup = BeautifulSoup(source)

needed_p = soup.find_all('p')[8]

bs = needed_p.find_all('b')

res = {}

for b in bs:
    if b.find_next('a').text:
        res[b.text] = b.find_next('a').text.strip().strip('
')
    if b.next_sibling != ' ':
        res[b.text] = b.next_sibling.strip().strip('
')

res

output:

{'Brother:': 'Rashad Haughton',
 'Mother:': 'Diane',
 'Husband:': 'R. Kelly',
 'Father:': 'Michael Haughton',
 'Boyfriend:': 'Damon Dash'}

EDIT: For additional info on top of the page:

... (code above) ...
soup = BeautifulSoup(source)

needed_p = soup.find_all('p')[1:4] + [soup.find_all('p')[8]] # here explicitly selecting needed p-tags for further parsing

res = {}

for p in needed_p:
    bs = p.find_all('b')
    for b in bs:
        if b.find_next('a').text:
            res[b.text] = b.find_next('a').text.strip().strip('
')
        if b.next_sibling != ' ':
            res[b.text] = b.next_sibling.strip().strip('
')

res

output:

{'Race or Ethnicity:': 'Black',
 'Husband:': 'R. Kelly',
 'Died:': '25-Aug',
 'Nationality:': 'United States',
 'Executive summary:': 'R&B singer, died in plane crash',
 'Mother:': 'Diane',
 'Birthplace:': 'Brooklyn, NY',
 'Born:': '16-Jan',
 'Boyfriend:': 'Damon Dash',
 'Sexual orientation:': 'Straight',
 'Occupation:': 'Singer',
 'Cause of death:': 'Accident - Airplane',
 'Brother:': 'Rashad Haughton',
 'Remains:': 'Interred,',
 'Gender:': 'Female',
 'Father:': 'Michael Haughton',
 'Location of death:': 'Marsh Harbour, Abaco Island, Bahamas'}

For precisely this page you also can scrape High school, for example, this way:

res['High School'] = soup.find_all('p')[9].text.split(':')[1].strip()

Python Beautiful Soup: How to extract text next to a tag?

Answers (2)

Related Questions