Reputation: 3152
I have a bunch of p
elements in a page of HTML
and using BeautifulSoup
to parse the HTML
page. The page is an index of an online book. What I need to do is create a nested JSON
structure where there is currently none, as some terms of the index are children of a single term.
So you can think of the index like this:
parent term
child term
child term
child term
parent term
parent term
However, the HTML is not nested, it is listed in all <p>
tags, like below. As you can see, the term Action(s)
is a parent term and has 8 children. Then the next parent term is Actionable Insights
and has 0 children. I have a loop that iterates through each <p>
tag, and need to nest the children under the parent in the JSON file. So I can't use find_next_siblings()
(plural), because it will just get all <p>
tags indiscriminately. But if I can find a way to use find_next_sibling()
(singular), but only those with the 'class': 'index2'
, and add them to a list, then I can add that list as a child to the parent term. At least, this is my logic so far.
<h2>A</h2>
<p class="index1">Acceptance of insights, merit-based, <a href="c01.xhtml#Page_3">3</a></p>
<p class="index1">Accuracy of data, <a href="c05.xhtml#Page_125">125</a>, <a href="c05.xhtml#Page_126">126</a></p>
<p class="index1">Action(s):</p>
<p class="index2">of audience, “so what?” question about, <a href="c05.xhtml#Page_133">133</a>–135</p>
<p class="index2">communicating to turn insights into, <a href="c01.xhtml#Page_10">10</a>–12</p>
<p class="index2">in deriving value from analytics, <a href="c01.xhtml#Page_11">11</a>–12</p>
<p class="index2">driving, <i>see</i> Driving action</p>
<p class="index2">empowering audience to act, <a href="c06.xhtml#Page_178">178</a>–180</p>
<p class="index2">in 4D Framework, <a href="c05.xhtml#Page_128">128</a>–132</p>
<p class="index2">inspired by insights, <a href="c01.xhtml#Page_9">9</a>–10</p>
<p class="index2">as objective of communication, <a href="c02.xhtml#Page_36">36</a>, <a href="c02.xhtml#Page_37">37</a></p>
<p class="index1">Actionable insights, <a href="c03.xhtml#Page_51">51</a>, <a href="c05.xhtml#Page_132">132</a>–135</p>
<p class="index1">Additive annotations, <a href="c08.xhtml#Page_244">244</a></p>
<p class="index1">Aggregating data, <a href="c08.xhtml#Page_232">232</a></p>
<p class="index1">AGT/HEED, <a href="c04.xhtml#Page_108">108</a>–109</p>
<p class="index1">Aha Moment:</p>
<p class="index2">connecting Hook and, <a href="c06.xhtml#Page_176">176</a></p>
<p class="index2">in Data Storytelling Arc, <a href="c06.xhtml#Page_163">163</a>–167</p>
<p class="index2">in data trailers, <a href="c06.xhtml#Page_181">181</a>, <a href="c06.xhtml#Page_182">182</a>, <a href="c09.xhtml#Page_292">292</a>–293</p>
<p class="index2">identified in storyboarding, <a href="c06.xhtml#Page_172">172</a>–173</p>
<p class="index2">initial interest generated by, <a href="c06.xhtml#Page_178">178</a></p>
<p class="index2">in manufacturing gross margin story, <a href="c09.xhtml#Page_295">295</a></p>
<p class="index2">in Rosling story, <a href="c09.xhtml#Page_273">273</a></p>
<p class="index2">in US education system story, <a href="c09.xhtml#Page_286">286</a></p>
The problem, however, is I can't figure out the logic for it. It's complicated, because I need recursion as well. But I keep getting NoneType
errors (noted below). The rest of the code works, if I take out that codeblock I'm stuck on. But how can I use BeautifulSoup to only get the next <p>
tag with a class of index2
? At least the children are identified as index2
. I just want to avoid scanning the entire document every time I need a few children terms. It seems like it should be straight-forward, but have not had luck. Thanks for your help.
MY CODE:
from bs4 import BeautifulSoup
import json
# convert html to bs4 object
def bs4_convert(file):
with open(file, encoding='utf8') as fp:
html = BeautifulSoup(fp, 'html.parser')
return html
# create a tag
def p_parser(el, link_prefix):
tags = []
for p in el:
tag = {
'tag': p.text,
'definition': '',
'source': [{'title': link.text, 'href': link_prefix + link['href']} for link in p.find_all('a', recursive=False)]
}
# add all child terms of a parent term to a list
STUCK HERE, THIS CODEBLOCK KEEPS GETTING SNAGGED ON NONETYPE ERROR, SAYING p.find_next_sibling('p')['class']
IS NOT SUBSCRIPTABLE. EVEN THOUGH I CHECK FOR NONE.
children = []
if(p.find_next_sibling('p') is not None):
while(p.find_next_sibling('p')['class'] == ['index2']):
next_child = p.find_next_sibling('p')
if(next_child is not None):
children.append(next_child)
p = next_child
else:
break
# make child tags
tag['children'] = p_parser(children, link_prefix)
tags.append(tag)
return tags
# loop through all indices
def html_parser(html, link_prefix):
tags = []
# extract index
html.find('section', {'role': "doc-index"})
# iterate over every indented letter in index
letters = html.find_all('section')
for letter in letters:
tags += p_parser(letter.find_all('p'), link_prefix)
return tags
# add the course name as parent to all tags
def add_course_tag(course_name, tags):
complete_tags = {
'tag': course_name,
'definition': '',
'source': tags
}
return complete_tags
# write tags to JSON file
def write_to_json(course_name, tags):
# Serializing json
json_object = json.dumps(tags, indent = 4)
# Writing to course_name.json
with open(course_name + '_tags.json', 'w') as outfile:
outfile.write(json_object)
def main():
# course information for the book
course = {
'course': 'data_storytelling', # exact course name
'file': 'data_storytelling.html', # the html file you extracted
'parse_type': 'index'
}
# this link prefix should be the same for all pages of one book
prefix_id = 'effective-data-storytelling/9781119615712'
link_prefix = 'https://learning.oreilly.com/library/view/' + prefix_id + '/'
tags = []
# parse the html
html = bs4_convert(course['file'])
# create tags
tags = html_parser(html, link_prefix)
# add course name as outermost tag
tags = add_course_tag(course['course'], tags)
# write results to json file
write_to_json(course['course'], tags)
if __name__ == "__main__":
main()
EDIT: I tried this code, but it would not stop running in the command line (and nothing new printed to the JSON file).
# create a tag
def p_parser(el, link_prefix):
tags = []
for p in el:
tag = {
'tag': p.text,
'definition': '',
'source': [{'title': link.text, 'href': link_prefix + link['href']} for link in p.find_all('a', recursive=False)]
}
# add all child terms of a parent term to a list
children = []
for child in p.next_siblings:
if child.name == 'p' and 'index2' not in child['class']:
break
elif child.name == 'p' and 'index2' in child['class']:
children.append(child)
tags.append(tag)
# make child tags
tag['children'] = p_parser(children, link_prefix)
return tags
Upvotes: 0
Views: 349
Reputation: 25196
Your are close to your goal, just some little adjustment to do - While iterating check for tag.name
as well as its class and break if it is not a <p>
with class containing index2
:
children = []
for c in p.next_siblings:
if c.name == 'p' and 'index2' not in c['class']:
break
elif c.name == 'p' and 'index2' in c['class']:
children.append(c)
Just to demonstrate, but I believe you would adapt it to your code.
import requests,bs4
html='''
<h2>A</h2>
<p class="index1">Acceptance of insights, merit-based, <a href="c01.xhtml#Page_3">3</a></p>
<p class="index1">Accuracy of data, <a href="c05.xhtml#Page_125">125</a>, <a href="c05.xhtml#Page_126">126</a></p>
<p class="index1">Action(s):</p>
<p class="index2">of audience, “so what?” question about, <a href="c05.xhtml#Page_133">133</a>–135</p>
<p class="index2">communicating to turn insights into, <a href="c01.xhtml#Page_10">10</a>–12</p>
<p class="index2">in deriving value from analytics, <a href="c01.xhtml#Page_11">11</a>–12</p>
<p class="index2">driving, <i>see</i> Driving action</p>
<p class="index2">empowering audience to act, <a href="c06.xhtml#Page_178">178</a>–180</p>
<p class="index2">in 4D Framework, <a href="c05.xhtml#Page_128">128</a>–132</p>
<p class="index2">inspired by insights, <a href="c01.xhtml#Page_9">9</a>–10</p>
<p class="index2">as objective of communication, <a href="c02.xhtml#Page_36">36</a>, <a href="c02.xhtml#Page_37">37</a></p>
<p class="index1">Actionable insights, <a href="c03.xhtml#Page_51">51</a>, <a href="c05.xhtml#Page_132">132</a>–135</p>
<p class="index1">Additive annotations, <a href="c08.xhtml#Page_244">244</a></p>
<p class="index1">Aggregating data, <a href="c08.xhtml#Page_232">232</a></p>
<p class="index1">AGT/HEED, <a href="c04.xhtml#Page_108">108</a>–109</p>
<p class="index1">Aha Moment:</p>
<p class="index2">connecting Hook and, <a href="c06.xhtml#Page_176">176</a></p>
<p class="index2">in Data Storytelling Arc, <a href="c06.xhtml#Page_163">163</a>–167</p>
<p class="index2">in data trailers, <a href="c06.xhtml#Page_181">181</a>, <a href="c06.xhtml#Page_182">182</a>, <a href="c09.xhtml#Page_292">292</a>–293</p>
<p class="index2">identified in storyboarding, <a href="c06.xhtml#Page_172">172</a>–173</p>
<p class="index2">initial interest generated by, <a href="c06.xhtml#Page_178">178</a></p>
<p class="index2">in manufacturing gross margin story, <a href="c09.xhtml#Page_295">295</a></p>
<p class="index2">in Rosling story, <a href="c09.xhtml#Page_273">273</a></p>
<p class="index2">in US education system story, <a href="c09.xhtml#Page_286">286</a></p>
'''
soup = bs4.BeautifulSoup(html)
# this link prefix should be the same for all pages of one book
prefix_id = 'effective-data-storytelling/9781119615712'
link_prefix = 'https://learning.oreilly.com/library/view/' + prefix_id + '/'
data = []
for p in soup.select('p.index1'):
tag = {
'tag': p.text,
'definition': '',
'source': [{'title': link.text, 'href': link_prefix + link['href']} for link in p.find_all('a', recursive=False)],
'children':[]
}
for c in p.next_siblings:
if c.name == 'p' and 'index1' in c['class']:
break
elif c.name == 'p' and 'index2' in c['class']:
tag['children'].append({
'tag': c.text,
'definition': '',
'source': [{'title': link.text, 'href': link_prefix + link['href']} for link in c.find_all('a', recursive=False)],
})
data.append(tag)
data
#create tag
def create_tag(p, link_prefix):
tag = {
'tag': p.text,
'definition': '',
'source': [{'title': link.text, 'href': link_prefix + link['href']} for link in p.find_all('a', recursive=False)]
}
return tag
#parse p and p children
def p_parser(el, link_prefix):
tags = []
for p in el:
tag = create_tag(p, link_prefix)
# add all child terms of a parent term to a list
children = []
for child in p.next_siblings:
if child.name == 'p' and 'index2' not in child['class']:
break
elif child.name == 'p' and 'index2' in child['class']:
if child is not None:
children.append(create_tag(child, link_prefix))
# make child tags
if children:
tag['children'] = children
# add any parent tags to tags
tags.append(tag)
return tags
Upvotes: 1