Reputation: 1617
I've found the text I want to replace, but when I print soup
the format gets changed. <div id="content">stuff here</div>
becomes <div id="content">stuff here</div>
. How can i preserve the data? I have tried print(soup.encode(formatter="none"))
, but that produces the same incorrect format.
from bs4 import BeautifulSoup
with open(index_file) as fp:
soup = BeautifulSoup(fp,"html.parser")
found = soup.find("div", {"id": "content"})
found.replace_with(data)
When I print found
, I get the correct format:
>>> print(found)
<div id="content">stuff</div>
index_file
contents are below:
<!DOCTYPE html>
<head>
Apples
</head>
<body>
<div id="page">
This is the Id of the page
<div id="main">
<div id="content">
stuff here
</div>
</div>
footer should go here
</div>
</body>
</html>
Upvotes: 3
Views: 2770
Reputation: 114300
The found
object is not a Python string, it's a Tag
that just happens to have a nice string representation. You can verify this by doing
type(found)
A Tag
is part of the hierarchy of objects that Beautiful Soup creates for you to be able to interact with the HTML. Another such object is NavigableString
. NavigableString
is a lot like a string, but it can only contain things that would go into the content portion of the HTML.
When you do
found.replace_with('<div id="content">stuff here</div>')
you are asking the Tag
to be replaced with a NavigableString
containing that literal text. The only way for HTML to be able to display that string is to escape all the angle brackets, as it's doing.
Instead of that mess, you probably want to keep your Tag
, and replace only it's content:
found.string.replace_with('stuff here')
Notice that the correct replacement does not attempt to overwrite the tags.
When you do found.replace_with(...)
, the object referred to by the name found
gets replaced in the parent hierarchy. However, the name found
keeps pointing to the same outdated object as before. That is why printing soup
shows the update, but printing found
does not.
Upvotes: 7