Parsing invalid HTML and retrieving tag´s text to replace it

Question

I need to iterate invalid HTML and obtain a text value from all tags to change it.

from bs4 import BeautifulSoup

html_doc = """

   
   
    Sklizeň jahod 2019
   
  
Začátek sklizně: Zahájeno

Otevřeno: 6 h – do otrhání, denně

"""

soup = BeautifulSoup(html_doc, "html.parser")

for tag in soup.find_all():
    print(tag.name)
    if tag.string:
        tag.string.replace_with("1")

print(soup)

The result is




1
 
Začátek sklizně: 1

Otevřeno: 1, denně

I know how to replace the text but bs won´t find the text of the paragraph tag. So the texts "Začátek sklizně:" and "Otevřeno:" and ", denně" are not found so I cannot replace them.

I tried using different parsers such as lxml and html5lib won´t make a difference. I tried python´s HTML library but that doesn´t support changing HTML only iterating it.

HedgeHog · Accepted Answer

.string returns on a tag type object a NavigableString type object -> Your tag has a single string child then returned value is that string, if it has no children or more than one child it will return None.

Scenario is not quiet clear to me, but here is one last approach based on your comment:

I need generic code to iterate any html and find all texts so I can work with them.

for tag in soup.find_all(text=True):
    tag.replace_with('1')

Example

from bs4 import BeautifulSoup

html_doc = """
   
   
    Sklizeň jahod 2019
   
  
Začátek sklizně: Zahájeno

Otevřeno: 6 h – do otrhání, denně
"""

soup = BeautifulSoup(html_doc, 'html.parser')

for tag in soup.find_all(text=True):
    tag.replace_with('1')

Output

Parsing invalid HTML and retrieving tag´s text to replace it

Answers (1)

Example

Output

Related Questions

Parsing invalid HTML and retrieving tag&#180;s text to replace it

Answers (1)

Example

Output

Related Questions

Parsing invalid HTML and retrieving tag´s text to replace it