Filip Bartoš
Filip Bartoš

Reputation: 301

Parsing invalid HTML and retrieving tag´s text to replace it

I need to iterate invalid HTML and obtain a text value from all tags to change it.

from bs4 import BeautifulSoup

html_doc = """
<div class="oxy-toggle toggle-7042 toggle-7042-expanded" data-oxy-toggle-active-class="toggle-7042-expanded" data-oxy-toggle-initial-state="closed" id="_toggle-212-142">
   <div class="oxy-expand-collapse-icon" href="#"></div>
   <div class="oxy-toggle-content">
    <h3 class="ct-headline" id="headline-213-142"><span class="ct-span" id="span-225-142">Sklizeň jahod 2019</span></h3>   </div>
  </div><div class="ct-text-block" id="text_block-214-142"><span class="ct-span" id="span-230-142"><p>Začátek sklizně: <strong>Zahájeno</strong><br>
Otevřeno: <strong>6 h – do otrhání</strong>, denně</p>
</span></div>
"""

soup = BeautifulSoup(html_doc, "html.parser")

for tag in soup.find_all():
    print(tag.name)
    if tag.string:
        tag.string.replace_with("1")

print(soup)

The result is

<div class="oxy-toggle toggle-7042 toggle-7042-expanded" data-oxy-toggle-active-class="toggle-7042-expanded" data-oxy-toggle-initial-state="closed" id="_toggle-212-142">
<div class="oxy-expand-collapse-icon" href="#"></div>
<div class="oxy-toggle-content">
<h3 class="ct-headline" id="headline-213-142"><span class="ct-span" id="span-225-142">1</span></h3> </div>
</div><div class="ct-text-block" id="text_block-214-142"><span class="ct-span" id="span-230-142"><p>Začátek sklizně: <strong>1</strong><br/>
Otevřeno: <strong>1</strong>, denně</p>
</span></div>

I know how to replace the text but bs won´t find the text of the paragraph tag. So the texts "Začátek sklizně:" and "Otevřeno:" and ", denně" are not found so I cannot replace them.

I tried using different parsers such as lxml and html5lib won´t make a difference. I tried python´s HTML library but that doesn´t support changing HTML only iterating it.

Upvotes: 1

Views: 141

Answers (1)

HedgeHog
HedgeHog

Reputation: 25073

.string returns on a tag type object a NavigableString type object -> Your tag has a single string child then returned value is that string, if it has no children or more than one child it will return None.

Scenario is not quiet clear to me, but here is one last approach based on your comment:

I need generic code to iterate any html and find all texts so I can work with them.

for tag in soup.find_all(text=True):
    tag.replace_with('1')

Example

from bs4 import BeautifulSoup

html_doc = """<div class="oxy-toggle toggle-7042 toggle-7042-expanded" data-oxy-toggle-active-class="toggle-7042-expanded" data-oxy-toggle-initial-state="closed" id="_toggle-212-142">
   <div class="oxy-expand-collapse-icon" href="#"></div>
   <div class="oxy-toggle-content">
    <h3 class="ct-headline" id="headline-213-142"><span class="ct-span" id="span-225-142">Sklizeň jahod 2019</span></h3>   </div>
  </div><div class="ct-text-block" id="text_block-214-142"><span class="ct-span" id="span-230-142"><p>Začátek sklizně: <strong>Zahájeno</strong><br>
Otevřeno: <strong>6 h – do otrhání</strong>, denně</p>
</span></div>"""

soup = BeautifulSoup(html_doc, 'html.parser')

for tag in soup.find_all(text=True):
    tag.replace_with('1')

Output

<div class="oxy-toggle toggle-7042 toggle-7042-expanded" data-oxy-toggle-active-class="toggle-7042-expanded" data-oxy-toggle-initial-state="closed" id="_toggle-212-142">1<div class="oxy-expand-collapse-icon" href="#"></div>1<div class="oxy-toggle-content">1<h3 class="ct-headline" id="headline-213-142"><span class="ct-span" id="span-225-142">1</span></h3>1</div>1</div><div class="ct-text-block" id="text_block-214-142"><span class="ct-span" id="span-230-142"><p>1<strong>1</strong><br/>1<strong>1</strong>1</p>1</span></div>

Upvotes: 1

Related Questions