Azurespot
Azurespot

Reputation: 3152

Beautiful Soup - get outer tag text without getting inner tag text

I have some <li> tags with nested <a> tags in an HTML file and have text in both the list and the <a> tag. However, I want to extract them separately. I want to have the <li> text become the value for the key tag, and the text inside the <a> tag become the value for the key for the children's tag. (See below for HTML snippet)

I ultimately print this to a JSON file, but am getting unwanted results. The main tag should only have "abstract visualization"... not all that other stuff. And the children's tag should only have "about", not "/ Emotive and abstract" following it. The "Emotive and abstract" has a place in the title already. You can see every entry of this index sample is showing the same pattern. How do I extract text to the right places? I am a beginner with Beautiful Soup. Thank you.

JSON file

{
    "tag": "abstract visualization\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\nabout / Emotive and abstract\n\n",
    "definition": "",
    "source": [],
    "children": [
        {
            "tag": "about / Emotive and abstract",
            "definition": "",
            "source": [
                {
                    "title": "Emotive and abstract",
                    "href": "https://learning.oreilly.com/library/view/data-visualization-a/9781849693462/ch02s03.html"
                }
            ]
        }
    ]
},
{
    "tag": "Adobe After Effects\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\nURL / Other specialist tools\n\n",
    "definition": "",
    "source": [],
    "children": [
        {
            "tag": "URL / Other specialist tools",
            "definition": "",
            "source": [
                {
                    "title": "Other specialist tools",
                    "href": "https://learning.oreilly.com/library/view/data-visualization-a/9781849693462/ch06.html"
                }
            ]
        }
    ]
},

HTML file snippet:

<ul id="letters">
    <li>abstract visualization
        <ul>
            <li>about / <a href="ch02s03.html" title="Emotive and abstract" class="link">Emotive and abstract</a></li>
        </ul>
    </li>
    <li>Adobe After Effects
        <ul>
            <li>URL / <a href="ch06.html" title="Other specialist tools" class="link">Other specialist tools</a></li>
        </ul>
    </li>
    <li>Adobe Flash
        <ul>
            <li>about / <a href="ch06.html" title="Programming environments" class="link">Programming environments</a></li>
            <li>URL / <a href="ch06.html" title="Programming environments" class="link">Programming environments</a></li>
        </ul>
    </li>
    <li>Adobe Illustrator
        <ul>
            <li>about / <a href="ch06.html" title="Other specialist tools" class="link">Other specialist tools</a></li>
            <li>URL / <a href="ch06.html" title="Other specialist tools" class="link">Other specialist tools</a></li>
        </ul>
    </li>
</ul>

Relevant code:

# convert html to bs4 object
def bs4_convert(file):
    with open(file, encoding='utf8') as fp:
        html = BeautifulSoup(fp, 'html.parser')
    return html

# create a tag
def li_parser(letter, link_prefix):
    tags = []
    for li in letter.find_all('li', recursive=False):
        tag = {
            'tag': li.text,
            'definition': '',
            'source': [{'title': link.text, 'href': link_prefix + link['href']} for link in li.find_all('a', recursive=False)]
        }
        if li.find('ul'):
            tag['children'] = li_parser(li.find('ul'), link_prefix)
        tags.append(tag)

    return tags

# loop through all indices
def html_parser(html, link_prefix):
    tags = []
    # extract index
    html.find(id='backindex')
    # iterate over every indented letter in index
    letters = html.find_all(attrs={'id': 'letters'})
    for letter in letters:
        tags += li_parser(letter, link_prefix)

    return tags

tags = []
# parse the html
html = bs4_convert(course['file'])
# create tags
tags = html_parser(html, link_prefix)
# add course name as outermost tag
tags = add_course_tag(course['course'], tags)

Upvotes: 1

Views: 1055

Answers (2)

HedgeHog
HedgeHog

Reputation: 25048

To get the right string for your tag you can go similar to @diggusbickus approache with stripped_strings while picking first element:

'tag': list(li.stripped_strings)[0].strip(' /')

Example

def li_parser(letter, link_prefix):
    tags = []
    for li in letter.find_all('li', recursive=False):
        tag = {
            'tag': list(li.stripped_strings)[0].strip(' /'),
            'definition': '',
            'source': [{'title': link.text, 'href': link_prefix + link['href']} for link in li.find_all('a', recursive=False)]
        }
        if li.find('ul'):
            tag['children'] = li_parser(li.find('ul'), link_prefix)
        tags.append(tag)

    return tags

Output

[{"tag": "abstract visualization", "definition": "", "source": [], "children": [{"tag": "about", "definition": "", "source": [{"title": "Emotive and abstract", "href": "https://learning.oreilly.com/library/view/effective-data-storytelling/9781119615712/ch02s03.html"}]}]}, {"tag": "Adobe After Effects", "definition": "", "source": [], "children": [{"tag": "URL", "definition": "", "source": [{"title": "Other specialist tools", "href": "https://learning.oreilly.com/library/view/effective-data-storytelling/9781119615712/ch06.html"}]}]}, {"tag": "Adobe Flash", "definition": "", "source": [], "children": [{"tag": "about", "definition": "", "source": [{"title": "Programming environments", "href": "https://learning.oreilly.com/library/view/effective-data-storytelling/9781119615712/ch06.html"}]}, {"tag": "URL", "definition": "", "source": [{"title": "Programming environments", "href": "https://learning.oreilly.com/library/view/effective-data-storytelling/9781119615712/ch06.html"}]}]}, {"tag": "Adobe Illustrator", "definition": "", "source": [], "children": [{"tag": "about", "definition": "", "source": [{"title": "Other specialist tools", "href": "https://learning.oreilly.com/library/view/effective-data-storytelling/9781119615712/ch06.html"}]}, {"tag": "URL", "definition": "", "source": [{"title": "Other specialist tools", "href": "https://learning.oreilly.com/library/view/effective-data-storytelling/9781119615712/ch06.html"}]}]}]

Upvotes: 1

folen gateis
folen gateis

Reputation: 2012

the children of a tag can be accessed through a list called contents. in your case the text you're searching is just contents[0] so it's easier than looping through all the children. you just have to remove unneeded tabs and lines with strip()

soup=BeautifulSoup(data, 'lxml')
lis=soup.select('#letters > li')
for li in lis:
    print(li.contents[0].strip())
    sub_li=li.select_one('ul li')
    print(sub_li.contents[0].strip()[:-2]) #get rid of the trailing slash

which outputs

abstract visualization
about
Adobe After Effects
URL
Adobe Flash
about
Adobe Illustrator
about

Upvotes: 1

Related Questions