Reputation: 14978
I have following HTML
<p>
<b>Father:</b> Michael Haughton
<br>
<b>Mother:</b> Diane
<br><b>Brother:</b>
Rashad Haughton<br>
<b>Husband:</b> <a href="/people/540/000024468/">R. Kelly</a> (m. 1994, annulled that same year)
<br><b>Boyfriend:</b> <a href="/people/420/000109093/">Damon Dash</a> (Roc-a-Fella co-CEO)<br></p>
I have to separate heading and text, for instance, Mother: Diane..
So in the end I would have a list of dictionaries as:
[{"label":"Mother","value":"Diane"}]
I was trying the below but not working:
def parse(u):
u = u.rstrip('\n')
r = requests.get(u, headers=headers)
if r.status_code == 200:
html = r.text.strip()
soup = BeautifulSoup(html, 'lxml')
headings = soup.select('table p')
for h in headings:
b = h.find('b')
if b is not None:
print(b.text)
print(h.text + '\n')
print('=================================')
url = 'http://www.nndb.com/people/742/000024670/'
Upvotes: 2
Views: 1646
Reputation: 3225
from bs4 import BeautifulSoup
from urllib.request import urlopen
#html = '''<p>
#<b>Father:</b> Michael Haughton
#<br>
#<b>Mother:</b> Diane
#<br><b>Brother:</b>
#Rashad Haughton<br>
#<b>Husband:</b> <a href="/people/540/000024468/">R. Kelly</a> (m. 1994, annulled that same year)
#<br><b>Boyfriend:</b> <a href="/people/420/000109093/">Damon Dash</a> (Roc-a-Fella co-CEO)<br></p>'''
page = urlopen('http://www.nndb.com/people/742/000024670/')
source = page.read()
soup = BeautifulSoup(source)
needed_p = soup.find_all('p')[8]
bs = needed_p.find_all('b')
res = {}
for b in bs:
if b.find_next('a').text:
res[b.text] = b.find_next('a').text.strip().strip('\n')
if b.next_sibling != ' ':
res[b.text] = b.next_sibling.strip().strip('\n')
res
output:
{'Brother:': 'Rashad Haughton',
'Mother:': 'Diane',
'Husband:': 'R. Kelly',
'Father:': 'Michael Haughton',
'Boyfriend:': 'Damon Dash'}
EDIT: For additional info on top of the page:
... (code above) ...
soup = BeautifulSoup(source)
needed_p = soup.find_all('p')[1:4] + [soup.find_all('p')[8]] # here explicitly selecting needed p-tags for further parsing
res = {}
for p in needed_p:
bs = p.find_all('b')
for b in bs:
if b.find_next('a').text:
res[b.text] = b.find_next('a').text.strip().strip('\n')
if b.next_sibling != ' ':
res[b.text] = b.next_sibling.strip().strip('\n')
res
output:
{'Race or Ethnicity:': 'Black',
'Husband:': 'R. Kelly',
'Died:': '25-Aug',
'Nationality:': 'United States',
'Executive summary:': 'R&B singer, died in plane crash',
'Mother:': 'Diane',
'Birthplace:': 'Brooklyn, NY',
'Born:': '16-Jan',
'Boyfriend:': 'Damon Dash',
'Sexual orientation:': 'Straight',
'Occupation:': 'Singer',
'Cause of death:': 'Accident - Airplane',
'Brother:': 'Rashad Haughton',
'Remains:': 'Interred,',
'Gender:': 'Female',
'Father:': 'Michael Haughton',
'Location of death:': 'Marsh Harbour, Abaco Island, Bahamas'}
For precisely this page you also can scrape High school, for example, this way:
res['High School'] = soup.find_all('p')[9].text.split(':')[1].strip()
Upvotes: 1
Reputation: 16720
You're looking for the next_sibling
tag attribute.
This gives you either the next NavigableString
or the next Tag
, depending on what it finds first.
Here is how you can use it:
html = """..."""
soup = BeautifulSoup(html)
bTags = soup.find_all('b')
for it_tag in bTags:
print(it_tag.string)
print(it_tag.next_sibling)
Output:
Father:
Michael Haughton
Mother:
Diane
Brother:
Rashad Haughton
Husband:
Boyfriend:
This seems a bit off.
It's partly because of the line breaks and the blanks, which you can get rid of easily with the str.strip
method.
Still, the Boyfriend
and Husband
entries are lacking a value.
It's because next_sibling
is either a NavigableString
(ie a str
) or a Tag
.
The blank between the <b>
tag and the <a>
tag here is interpreted as a non-empty text:
<b>Boyfriend:</b> <a href="/people/420/000109093/">Damon Dash</a>
^
If it were absent, <b>Boyfriend:</b>
's next sibling would be the <a>
tag.
Since it's present, you have to check:
If the next sibling is a whitespace-only string, then the information you're looking for is that NavigableString
's next sibling, which will be a <a>
tag.
Edited code:
bTags = soup.find_all('b')
for it_tag in bTags:
print(it_tag.string)
nextSibling = it_tag.next_sibling
if nextSibling is not None:
if isinstance(nextSibling, str):
if nextSibling.isspace():
print(it_tag.next_sibling.next_sibling.string.strip())
else:
print(nextSibling.strip())
elif isinstance(it_tag.next_sibling, bs4.Tag):
print(it_tag.next_sibling.string)
Output:
Father:
Michael Haughton
Mother:
Diane
Brother:
Rashad Haughton
Husband:
R. Kelly
Boyfriend:
Damon Dash
Now you can easily build your dictionary:
entries = {}
bTags = soup.find_all('b')
for it_tag in bTags:
key = it_tag.string.replace(':', '')
value = None
nextSibling = it_tag.next_sibling
if nextSibling is not None:
if isinstance(nextSibling, str):
if nextSibling.isspace():
value = it_tag.next_sibling.next_sibling.string.strip()
else:
value = nextSibling.strip()
elif isinstance(it_tag.next_sibling, bs4.Tag):
value = it_tag.next_sibling.string
entries[key] = value
Output dictionary:
{'Father': 'Michael Haughton',
'Mother': 'Diane',
'Brother': 'Rashad Haughton',
'Husband': 'R. Kelly',
'Boyfriend': 'Damon Dash'}
Upvotes: 0