Azurespot
Azurespot

Reputation: 3152

Use find_next_sibling() for specific class value only

I have a bunch of p elements in a page of HTML and using BeautifulSoup to parse the HTML page. The page is an index of an online book. What I need to do is create a nested JSON structure where there is currently none, as some terms of the index are children of a single term. So you can think of the index like this:

parent term
    child term
    child term
    child term
parent term
parent term

However, the HTML is not nested, it is listed in all <p> tags, like below. As you can see, the term Action(s) is a parent term and has 8 children. Then the next parent term is Actionable Insights and has 0 children. I have a loop that iterates through each <p> tag, and need to nest the children under the parent in the JSON file. So I can't use find_next_siblings() (plural), because it will just get all <p> tags indiscriminately. But if I can find a way to use find_next_sibling() (singular), but only those with the 'class': 'index2', and add them to a list, then I can add that list as a child to the parent term. At least, this is my logic so far.

<h2>A</h2>
    <p class="index1">Acceptance of insights, merit-based, <a href="c01.xhtml#Page_3">3</a></p>
    <p class="index1">Accuracy of data, <a href="c05.xhtml#Page_125">125</a>, <a href="c05.xhtml#Page_126">126</a></p>
    <p class="index1">Action(s):</p>
    <p class="index2">of audience, “so what?” question about, <a href="c05.xhtml#Page_133">133</a>–135</p>
    <p class="index2">communicating to turn insights into, <a href="c01.xhtml#Page_10">10</a>–12</p>
    <p class="index2">in deriving value from analytics, <a href="c01.xhtml#Page_11">11</a>–12</p>
    <p class="index2">driving, <i>see</i> Driving action</p>
    <p class="index2">empowering audience to act, <a href="c06.xhtml#Page_178">178</a>–180</p>
    <p class="index2">in 4D Framework, <a href="c05.xhtml#Page_128">128</a>–132</p>
    <p class="index2">inspired by insights, <a href="c01.xhtml#Page_9">9</a>–10</p>
    <p class="index2">as objective of communication, <a href="c02.xhtml#Page_36">36</a>, <a href="c02.xhtml#Page_37">37</a></p>
    <p class="index1">Actionable insights, <a href="c03.xhtml#Page_51">51</a>, <a href="c05.xhtml#Page_132">132</a>–135</p>
    <p class="index1">Additive annotations, <a href="c08.xhtml#Page_244">244</a></p>
    <p class="index1">Aggregating data, <a href="c08.xhtml#Page_232">232</a></p>
    <p class="index1">AGT/HEED, <a href="c04.xhtml#Page_108">108</a>–109</p>
    <p class="index1">Aha Moment:</p>
    <p class="index2">connecting Hook and, <a href="c06.xhtml#Page_176">176</a></p>
    <p class="index2">in Data Storytelling Arc, <a href="c06.xhtml#Page_163">163</a>–167</p>
    <p class="index2">in data trailers, <a href="c06.xhtml#Page_181">181</a>, <a href="c06.xhtml#Page_182">182</a>, <a href="c09.xhtml#Page_292">292</a>–293</p>
    <p class="index2">identified in storyboarding, <a href="c06.xhtml#Page_172">172</a>–173</p>
    <p class="index2">initial interest generated by, <a href="c06.xhtml#Page_178">178</a></p>
    <p class="index2">in manufacturing gross margin story, <a href="c09.xhtml#Page_295">295</a></p>
    <p class="index2">in Rosling story, <a href="c09.xhtml#Page_273">273</a></p>
    <p class="index2">in US education system story, <a href="c09.xhtml#Page_286">286</a></p>

The problem, however, is I can't figure out the logic for it. It's complicated, because I need recursion as well. But I keep getting NoneType errors (noted below). The rest of the code works, if I take out that codeblock I'm stuck on. But how can I use BeautifulSoup to only get the next <p> tag with a class of index2? At least the children are identified as index2. I just want to avoid scanning the entire document every time I need a few children terms. It seems like it should be straight-forward, but have not had luck. Thanks for your help.

MY CODE:

from bs4 import BeautifulSoup
import json

# convert html to bs4 object
def bs4_convert(file):
    with open(file, encoding='utf8') as fp:
        html = BeautifulSoup(fp, 'html.parser')
    return html

# create a tag
def p_parser(el, link_prefix):
    tags = []
    for p in el:
        tag = {
            'tag': p.text,
            'definition': '',
            'source': [{'title': link.text, 'href': link_prefix + link['href']} for link in p.find_all('a', recursive=False)]
        }
        # add all child terms of a parent term to a list

STUCK HERE, THIS CODEBLOCK KEEPS GETTING SNAGGED ON NONETYPE ERROR, SAYING p.find_next_sibling('p')['class'] IS NOT SUBSCRIPTABLE. EVEN THOUGH I CHECK FOR NONE.

        children = []
        if(p.find_next_sibling('p') is not None):
            while(p.find_next_sibling('p')['class'] == ['index2']):
                next_child = p.find_next_sibling('p')
                if(next_child is not None):
                    children.append(next_child)
                    p = next_child
                else:
                    break
                
        # make child tags
        tag['children'] = p_parser(children, link_prefix)

        tags.append(tag)

    return tags

# loop through all indices
def html_parser(html, link_prefix):
    tags = []
    # extract index
    html.find('section', {'role': "doc-index"})
    # iterate over every indented letter in index
    letters = html.find_all('section')
    for letter in letters:
        tags += p_parser(letter.find_all('p'), link_prefix)

    return tags

# add the course name as parent to all tags
def add_course_tag(course_name, tags):
    complete_tags = {
        'tag': course_name,
        'definition': '',
        'source': tags
    }

    return complete_tags

# write tags to JSON file
def write_to_json(course_name, tags):
    # Serializing json 
    json_object = json.dumps(tags, indent = 4)

    # Writing to course_name.json
    with open(course_name + '_tags.json', 'w') as outfile:
        outfile.write(json_object)

def main():
    # course information for the book
    course = {
        'course': 'data_storytelling', # exact course name
        'file': 'data_storytelling.html', # the html file you extracted
        'parse_type': 'index'
    }

    # this link prefix should be the same for all pages of one book
    prefix_id = 'effective-data-storytelling/9781119615712'
    link_prefix = 'https://learning.oreilly.com/library/view/' + prefix_id + '/'

    tags = []
    # parse the html
    html = bs4_convert(course['file'])
    # create tags
    tags = html_parser(html, link_prefix)
    # add course name as outermost tag
    tags = add_course_tag(course['course'], tags)
    # write results to json file
    write_to_json(course['course'], tags)

if __name__ == "__main__":
    main()

EDIT: I tried this code, but it would not stop running in the command line (and nothing new printed to the JSON file).

# create a tag
def p_parser(el, link_prefix):
    tags = []
    for p in el:
        tag = {
            'tag': p.text,
            'definition': '',
            'source': [{'title': link.text, 'href': link_prefix + link['href']} for link in p.find_all('a', recursive=False)]
        }
        # add all child terms of a parent term to a list
        children = []
        for child in p.next_siblings:
            if child.name == 'p' and 'index2' not in child['class']:
                break
            elif child.name == 'p' and 'index2' in child['class']:
                children.append(child) 

        tags.append(tag)
        # make child tags
        tag['children'] = p_parser(children, link_prefix)

    return tags

Upvotes: 0

Views: 349

Answers (1)

HedgeHog
HedgeHog

Reputation: 25196

Your are close to your goal, just some little adjustment to do - While iterating check for tag.name as well as its class and break if it is not a <p> with class containing index2:

children = []

for c in p.next_siblings:
    if c.name == 'p' and 'index2' not in c['class']:
        break
    elif c.name == 'p' and 'index2' in c['class']:
        children.append(c)

Example

Just to demonstrate, but I believe you would adapt it to your code.

import requests,bs4
html='''
<h2>A</h2>
    <p class="index1">Acceptance of insights, merit-based, <a href="c01.xhtml#Page_3">3</a></p>
    <p class="index1">Accuracy of data, <a href="c05.xhtml#Page_125">125</a>, <a href="c05.xhtml#Page_126">126</a></p>
    <p class="index1">Action(s):</p>
    <p class="index2">of audience, “so what?” question about, <a href="c05.xhtml#Page_133">133</a>–135</p>
    <p class="index2">communicating to turn insights into, <a href="c01.xhtml#Page_10">10</a>–12</p>
    <p class="index2">in deriving value from analytics, <a href="c01.xhtml#Page_11">11</a>–12</p>
    <p class="index2">driving, <i>see</i> Driving action</p>
    <p class="index2">empowering audience to act, <a href="c06.xhtml#Page_178">178</a>–180</p>
    <p class="index2">in 4D Framework, <a href="c05.xhtml#Page_128">128</a>–132</p>
    <p class="index2">inspired by insights, <a href="c01.xhtml#Page_9">9</a>–10</p>
    <p class="index2">as objective of communication, <a href="c02.xhtml#Page_36">36</a>, <a href="c02.xhtml#Page_37">37</a></p>
    <p class="index1">Actionable insights, <a href="c03.xhtml#Page_51">51</a>, <a href="c05.xhtml#Page_132">132</a>–135</p>
    <p class="index1">Additive annotations, <a href="c08.xhtml#Page_244">244</a></p>
    <p class="index1">Aggregating data, <a href="c08.xhtml#Page_232">232</a></p>
    <p class="index1">AGT/HEED, <a href="c04.xhtml#Page_108">108</a>–109</p>
    <p class="index1">Aha Moment:</p>
    <p class="index2">connecting Hook and, <a href="c06.xhtml#Page_176">176</a></p>
    <p class="index2">in Data Storytelling Arc, <a href="c06.xhtml#Page_163">163</a>–167</p>
    <p class="index2">in data trailers, <a href="c06.xhtml#Page_181">181</a>, <a href="c06.xhtml#Page_182">182</a>, <a href="c09.xhtml#Page_292">292</a>–293</p>
    <p class="index2">identified in storyboarding, <a href="c06.xhtml#Page_172">172</a>–173</p>
    <p class="index2">initial interest generated by, <a href="c06.xhtml#Page_178">178</a></p>
    <p class="index2">in manufacturing gross margin story, <a href="c09.xhtml#Page_295">295</a></p>
    <p class="index2">in Rosling story, <a href="c09.xhtml#Page_273">273</a></p>
    <p class="index2">in US education system story, <a href="c09.xhtml#Page_286">286</a></p>
'''
soup = bs4.BeautifulSoup(html)

# this link prefix should be the same for all pages of one book
prefix_id = 'effective-data-storytelling/9781119615712'
link_prefix = 'https://learning.oreilly.com/library/view/' + prefix_id + '/'

data = []

for p in soup.select('p.index1'):
    tag = {
            'tag': p.text,
            'definition': '',
            'source': [{'title': link.text, 'href': link_prefix + link['href']} for link in p.find_all('a', recursive=False)],
            'children':[]
        }
    
    for c in p.next_siblings:
        if c.name == 'p' and 'index1' in c['class']:
            break
        elif c.name == 'p' and 'index2' in c['class']:
            tag['children'].append({
                'tag': c.text,
                'definition': '',
                'source': [{'title': link.text, 'href': link_prefix + link['href']} for link in c.find_all('a', recursive=False)],
            })
    data.append(tag)
    
data

EDIT

#create tag
def create_tag(p, link_prefix):
    tag = {
        'tag': p.text,
        'definition': '',
        'source': [{'title': link.text, 'href': link_prefix + link['href']} for link in p.find_all('a', recursive=False)]
    }
return tag

#parse p and p children
def p_parser(el, link_prefix):
    tags = []
    for p in el:
        tag = create_tag(p, link_prefix)
        # add all child terms of a parent term to a list
        children = []
        for child in p.next_siblings:
            if child.name == 'p' and 'index2' not in child['class']:
                break
            elif child.name == 'p' and 'index2' in child['class']:
                if child is not None:
                    children.append(create_tag(child, link_prefix)) 
       
        # make child tags
        if children:
            tag['children'] = children

        # add any parent tags to tags
        tags.append(tag)

    return tags

Upvotes: 1

Related Questions