user3525290
user3525290

Reputation: 1617

Beautiful Soup replaces < with <

I've found the text I want to replace, but when I print soup the format gets changed. <div id="content">stuff here</div> becomes &lt;div id="content"&gt;stuff here&lt;/div&gt;. How can i preserve the data? I have tried print(soup.encode(formatter="none")), but that produces the same incorrect format.

from bs4 import BeautifulSoup

with open(index_file) as fp:
    soup = BeautifulSoup(fp,"html.parser")

found = soup.find("div", {"id": "content"})
found.replace_with(data)

When I print found, I get the correct format:

>>> print(found)
<div id="content">stuff</div>

index_file contents are below:

 <!DOCTYPE html>
 <head>
    Apples 
 </head>
 <body>

   <div id="page">
    This is the Id of the page

  <div id="main">

     <div id="content">
       stuff here
     </div>
  </div>
 footer should go here
 </div>
</body>
</html>

Upvotes: 3

Views: 2770

Answers (1)

Mad Physicist
Mad Physicist

Reputation: 114300

The found object is not a Python string, it's a Tag that just happens to have a nice string representation. You can verify this by doing

type(found)

A Tag is part of the hierarchy of objects that Beautiful Soup creates for you to be able to interact with the HTML. Another such object is NavigableString. NavigableString is a lot like a string, but it can only contain things that would go into the content portion of the HTML.

When you do

found.replace_with('<div id="content">stuff here</div>')

you are asking the Tag to be replaced with a NavigableString containing that literal text. The only way for HTML to be able to display that string is to escape all the angle brackets, as it's doing.

Instead of that mess, you probably want to keep your Tag, and replace only it's content:

found.string.replace_with('stuff here')

Notice that the correct replacement does not attempt to overwrite the tags.

When you do found.replace_with(...), the object referred to by the name found gets replaced in the parent hierarchy. However, the name found keeps pointing to the same outdated object as before. That is why printing soup shows the update, but printing found does not.

Upvotes: 7

Related Questions