j_allen_morris
j_allen_morris

Reputation: 599

Checking children for tags in Beautiful Soup 4 with python

I am using BeautifulSoup 4 with python to parse through some HTML. Here's the code:

from bs4 import BeautifulSoup as bs
html_doc = '<p class="line-spacing-double" align="center">IN <i>THE </i><b>DISTRICT</b> COURT OF {county} COUNTY\nSTATE OF OKLAHOMA</p>'

soup = bs(html_doc, 'html.parser')
para = soup.p

for child in soup.p.children:
    print (child)

The result is:

IN
<i>THE </i>
<b>DISTRICT</b>
 COURT OF {county} COUNTY
STATE OF OKLAHOMA

This all makes sense. What I'm trying to do is iterate through the results and if I find a <i> or <b> then do something different with them. When I try the following, it doesn't work:

for child in soup.p.children:
    if child.findChildren('i'):
        print('italics found')

The error is because the first returned child is a string and I'm trying to search it for a child tag and BS4 already knows there are no children present.

So I changed up the code to check if the child is a string, and if so, do not attempt to take any action on it, just print it out.

for child in soup.p.children:
    if isinstance(child, str):
        print(child)
    elif child.findAll('i'):
        for tag in child.findAll('i'):
            print(tag)

The result of this latest code:

IN
 COURT OF {county} COUNTY
STATE OF OKLAHOMA

As I loop through the results, I need to be able to check for tags in the result, but I can't seem to figure out how. I thought it should be simple, but I'm stumped.

EDIT:

In response to jacalvo:

If I run

for child in soup.p.children:
    if child.find('i'):
        print(child)

It still fails to print out the 2nd and 3rd lines from the HTML code

Edit:

for child in soup.p.children:
    if isinstance(child, str):
        print(child)
    else:
        print(child.findChildren('i', recursive=False))

This resulted in:

IN
[]
[]
 COURT OF {county} COUNTY
STATE OF OKLAHOMA

Upvotes: 1

Views: 3328

Answers (3)

jslipknot
jslipknot

Reputation: 435

    from bs4 import BeautifulSoup as bs

    html_doc = '<p class="line-spacing-double" align="center">IN <i>THE </i><b>DISTRICT</b> COURT OF {county} ' \
               'COUNTY\nSTATE OF OKLAHOMA</p> '

    soup = bs(html_doc, 'html.parser')
    paragraph = soup.p

    # all tags dynamically gotten
    tags = [tag.name for tag in soup.find_all()]

    for child in paragraph.children:
        if child.name in tags:
            print('{0}'.format(child))  # or child.text
        else:
            print(child)

Output

    IN 
    <i>THE </i>
    <b>DISTRICT</b>
     COURT OF {county} COUNTY
    STATE OF OKLAHOMA

Upvotes: 0

Mark Tolonen
Mark Tolonen

Reputation: 177725

Is this an example what your are trying to do as an example of "do something different" with tags? Having a sample of the full desired output in the question would help:

from bs4 import BeautifulSoup as bs

html_doc = '<p class="line-spacing-double" align="center">IN <i>THE</i> <b>DISTRICT</b> COURT OF {county} COUNTY\nSTATE OF OKLAHOMA</p>'
soup = bs(html_doc, 'html.parser')
para = soup.p

for child in para.children:
    if child.name == 'i':
        print(f'*{child.text}*',end='')
    elif child.name == 'b':
        print(f'**{child.text}**',end='')
    else:
        print(child,end='')

Output:

IN *THE* **DISTRICT** COURT OF {county} COUNTY
STATE OF OKLAHOMA

Upvotes: 1

KunduK
KunduK

Reputation: 33384

Use findChildren() and then check the child name with if conditions.

from bs4 import BeautifulSoup as bs
html_doc = '<p class="line-spacing-double" align="center">IN <i>THE </i><b>DISTRICT</b> COURT OF {county} COUNTY\nSTATE OF OKLAHOMA</p>'

soup = bs(html_doc, 'html.parser')

for child in soup.find('p').findChildren(recursive=False) :
    if child.name=='i':
        print(child)
    if child.name=='b':
        print(child)

Output:

<i>THE </i>
<b>DISTRICT</b>

Upvotes: 0

Related Questions