Replace abitrary HTML (subtree) within HTML document with other HTML (subtree) with BS4 or regex

Question

I am trying to build a function along the following lines:

import bs4

def replace(html: str, selector: str, old: str, new: str) -> str:
    
    soup = bs4.BeautifulSoup(html) # likely complete HTML document
    old_soup = bs4.BeautifulSoup(old) # can contain HTML tags etc
    new_soup = bs4.BeautifulSoup(new) # can contain HTML tags etc
    
    for selected in soup.select(selector):
        
        ### pseudo-code start
        for match in selected.find_everything(old_soup):
            match.replace_with(new_soup)
        ### pseudo-code end
    
    return str(soup)

I want to be able to replace an arbitrary HTML subtree below a CSS selector within a full HTML document with another arbitrary HTML subtree. selector, old and new are read as strings from a configuration file.

My document could look as follows:

before = r"""


    No target here</head>
</head>
<body>
    <h1>This is the target!</h1>
    <p class="target">
        Yet another <b>target</b>.
    </p>
    <p>
        <!-- Comment -->
        Foo target Bar
    </p>
</body>
</html>
"""
</code></pre>
<p>This is supposed to work:</p>
<pre class="lang-py prettyprint-override"><code>after = replace(
    html = before,
    selector = 'body', # from config text file
    old = 'target', # from config text file
    new = '<span class="special">target</span>', # from config text file
)

assert after == r"""<!DOCTYPE html>
<html>
<head>
    <title>No target here</head>
</head>
<body>
    <h1>This is the <span class="special">target</span>!</h1>
    <p class="target">
        Yet another <b><span class="special">target</span></b>.
    </p>
    <p>
        <!-- Comment -->
        Foo <span class="special">target</span> Bar
    </p>
</body>
</html>
"""
</code></pre>
<p>A plain <code>str.replace</code> does not work because the "target" can appear literally everywhere ... I have briefly considered to do this with a regular expression. I have to admit that I did not succeed, but I'd be happy to see this working. Currently, I think my best chance is to use beautifulsoup.</p>
<p>I understand how to swap a specific tag. I can also replace specific text etc. However, I am really failing to replace an "arbitrary HTML subtree", as in I want to replace some HTML with some other HTML in a sane manner. In this context, I want to treat <code>old</code> and <code>new</code> really as HTML, so if <code>old</code> is simply a "word" that does also appear for instance in a class name, I really only want to replace it if it is content in the document, but not if it is a class name as shown above.</p>
<p>Any ideas how to do this?</p>

Replace abitrary HTML (subtree) within HTML document with other HTML (subtree) with BS4 or regex

Answers (1)

Related Questions