Reputation: 3729
I am trying to build a function along the following lines:
import bs4
def replace(html: str, selector: str, old: str, new: str) -> str:
soup = bs4.BeautifulSoup(html) # likely complete HTML document
old_soup = bs4.BeautifulSoup(old) # can contain HTML tags etc
new_soup = bs4.BeautifulSoup(new) # can contain HTML tags etc
for selected in soup.select(selector):
### pseudo-code start
for match in selected.find_everything(old_soup):
match.replace_with(new_soup)
### pseudo-code end
return str(soup)
I want to be able to replace an arbitrary HTML subtree below a CSS selector within a full HTML document with another arbitrary HTML subtree. selector
, old
and new
are read as strings from a configuration file.
My document could look as follows:
before = r"""<!DOCTYPE html>
<html>
<head>
<title>No target here</head>
</head>
<body>
<h1>This is the target!</h1>
<p class="target">
Yet another <b>target</b>.
</p>
<p>
<!-- Comment -->
Foo target Bar
</p>
</body>
</html>
"""
This is supposed to work:
after = replace(
html = before,
selector = 'body', # from config text file
old = 'target', # from config text file
new = '<span class="special">target</span>', # from config text file
)
assert after == r"""<!DOCTYPE html>
<html>
<head>
<title>No target here</head>
</head>
<body>
<h1>This is the <span class="special">target</span>!</h1>
<p class="target">
Yet another <b><span class="special">target</span></b>.
</p>
<p>
<!-- Comment -->
Foo <span class="special">target</span> Bar
</p>
</body>
</html>
"""
A plain str.replace
does not work because the "target" can appear literally everywhere ... I have briefly considered to do this with a regular expression. I have to admit that I did not succeed, but I'd be happy to see this working. Currently, I think my best chance is to use beautifulsoup.
I understand how to swap a specific tag. I can also replace specific text etc. However, I am really failing to replace an "arbitrary HTML subtree", as in I want to replace some HTML with some other HTML in a sane manner. In this context, I want to treat old
and new
really as HTML, so if old
is simply a "word" that does also appear for instance in a class name, I really only want to replace it if it is content in the document, but not if it is a class name as shown above.
Any ideas how to do this?
Upvotes: 1
Views: 124
Reputation: 71471
The solution below works in three parts:
All matches of selector
from html
are discovered.
Then, each match (as a soup
object) is recursively traversed and every child is matched against old
.
If the child object is equivalent to old
, then it is extracted and new
is inserted into the original match at the same index as the child object.
import bs4
from bs4 import BeautifulSoup as soup
def replace(html:str, selector:str, old:str, new:str) -> str:
def update_html(d:soup, old:soup) -> None:
i = 0
while (c:=getattr(d, 'contents', [])[i:]):
if isinstance((a:=c[0]), bs4.element.NavigableString) and str(old) in str(a):
a.extract()
for j, k in enumerate((l:=str(a).split(str(old)))):
i += 1
d.insert(i, soup(k, 'html.parser'))
if j + 1 != len(l):
i += 1
d.insert(i, soup(new, 'html.parser'))
elif a == old:
a.extract()
d.insert(i, soup(new, 'html.parser'))
i += 1
else:
update_html(a, old)
i += 1
source, o = [soup(j, 'html.parser') for j in [html, old]]
for i in source.select(selector):
update_html(i, o.contents[0])
return str(source)
after = replace(
html = before,
selector = 'body', # from config text file
old = 'target', # from config text file
new = '<span class="special">target</span>', # from config text file
)
print(after)
Output:
<!DOCTYPE html>
<html>
<head>
<title>No target here</title></head>
<body>
<h1>This is the <span class="special">target</span>!</h1>
<p class="target">
Yet another <b><span class="special">target</span></b>.
</p>
<p>
<!-- Comment -->
Foo <span class="special">target</span> Bar
</p>
</body>
</html>
Upvotes: 1