Reputation: 599
I am using BeautifulSoup 4 with python to parse through some HTML. Here's the code:
from bs4 import BeautifulSoup as bs
html_doc = '<p class="line-spacing-double" align="center">IN <i>THE </i><b>DISTRICT</b> COURT OF {county} COUNTY\nSTATE OF OKLAHOMA</p>'
soup = bs(html_doc, 'html.parser')
para = soup.p
for child in soup.p.children:
print (child)
The result is:
IN
<i>THE </i>
<b>DISTRICT</b>
COURT OF {county} COUNTY
STATE OF OKLAHOMA
This all makes sense. What I'm trying to do is iterate through the results and if I find a <i>
or <b>
then do something different with them. When I try the following, it doesn't work:
for child in soup.p.children:
if child.findChildren('i'):
print('italics found')
The error is because the first returned child is a string and I'm trying to search it for a child tag and BS4 already knows there are no children present.
So I changed up the code to check if the child is a string, and if so, do not attempt to take any action on it, just print it out.
for child in soup.p.children:
if isinstance(child, str):
print(child)
elif child.findAll('i'):
for tag in child.findAll('i'):
print(tag)
The result of this latest code:
IN
COURT OF {county} COUNTY
STATE OF OKLAHOMA
As I loop through the results, I need to be able to check for tags in the result, but I can't seem to figure out how. I thought it should be simple, but I'm stumped.
EDIT:
In response to jacalvo:
If I run
for child in soup.p.children:
if child.find('i'):
print(child)
It still fails to print out the 2nd and 3rd lines from the HTML code
Edit:
for child in soup.p.children:
if isinstance(child, str):
print(child)
else:
print(child.findChildren('i', recursive=False))
This resulted in:
IN
[]
[]
COURT OF {county} COUNTY
STATE OF OKLAHOMA
Upvotes: 1
Views: 3328
Reputation: 435
from bs4 import BeautifulSoup as bs
html_doc = '<p class="line-spacing-double" align="center">IN <i>THE </i><b>DISTRICT</b> COURT OF {county} ' \
'COUNTY\nSTATE OF OKLAHOMA</p> '
soup = bs(html_doc, 'html.parser')
paragraph = soup.p
# all tags dynamically gotten
tags = [tag.name for tag in soup.find_all()]
for child in paragraph.children:
if child.name in tags:
print('{0}'.format(child)) # or child.text
else:
print(child)
Output
IN
<i>THE </i>
<b>DISTRICT</b>
COURT OF {county} COUNTY
STATE OF OKLAHOMA
Upvotes: 0
Reputation: 177725
Is this an example what your are trying to do as an example of "do something different" with tags? Having a sample of the full desired output in the question would help:
from bs4 import BeautifulSoup as bs
html_doc = '<p class="line-spacing-double" align="center">IN <i>THE</i> <b>DISTRICT</b> COURT OF {county} COUNTY\nSTATE OF OKLAHOMA</p>'
soup = bs(html_doc, 'html.parser')
para = soup.p
for child in para.children:
if child.name == 'i':
print(f'*{child.text}*',end='')
elif child.name == 'b':
print(f'**{child.text}**',end='')
else:
print(child,end='')
Output:
IN *THE* **DISTRICT** COURT OF {county} COUNTY
STATE OF OKLAHOMA
Upvotes: 1
Reputation: 33384
Use findChildren
() and then check the child name with if conditions.
from bs4 import BeautifulSoup as bs
html_doc = '<p class="line-spacing-double" align="center">IN <i>THE </i><b>DISTRICT</b> COURT OF {county} COUNTY\nSTATE OF OKLAHOMA</p>'
soup = bs(html_doc, 'html.parser')
for child in soup.find('p').findChildren(recursive=False) :
if child.name=='i':
print(child)
if child.name=='b':
print(child)
<i>THE </i>
<b>DISTRICT</b>
Upvotes: 0