s-m-e
s-m-e

Reputation: 3729

Replace abitrary HTML (subtree) within HTML document with other HTML (subtree) with BS4 or regex

I am trying to build a function along the following lines:

import bs4

def replace(html: str, selector: str, old: str, new: str) -> str:
    
    soup = bs4.BeautifulSoup(html) # likely complete HTML document
    old_soup = bs4.BeautifulSoup(old) # can contain HTML tags etc
    new_soup = bs4.BeautifulSoup(new) # can contain HTML tags etc
    
    for selected in soup.select(selector):
        
        ### pseudo-code start
        for match in selected.find_everything(old_soup):
            match.replace_with(new_soup)
        ### pseudo-code end
    
    return str(soup)

I want to be able to replace an arbitrary HTML subtree below a CSS selector within a full HTML document with another arbitrary HTML subtree. selector, old and new are read as strings from a configuration file.

My document could look as follows:

before = r"""<!DOCTYPE html>
<html>
<head>
    <title>No target here</head>
</head>
<body>
    <h1>This is the target!</h1>
    <p class="target">
        Yet another <b>target</b>.
    </p>
    <p>
        <!-- Comment -->
        Foo target Bar
    </p>
</body>
</html>
"""

This is supposed to work:

after = replace(
    html = before,
    selector = 'body', # from config text file
    old = 'target', # from config text file
    new = '<span class="special">target</span>', # from config text file
)

assert after == r"""<!DOCTYPE html>
<html>
<head>
    <title>No target here</head>
</head>
<body>
    <h1>This is the <span class="special">target</span>!</h1>
    <p class="target">
        Yet another <b><span class="special">target</span></b>.
    </p>
    <p>
        <!-- Comment -->
        Foo <span class="special">target</span> Bar
    </p>
</body>
</html>
"""

A plain str.replace does not work because the "target" can appear literally everywhere ... I have briefly considered to do this with a regular expression. I have to admit that I did not succeed, but I'd be happy to see this working. Currently, I think my best chance is to use beautifulsoup.

I understand how to swap a specific tag. I can also replace specific text etc. However, I am really failing to replace an "arbitrary HTML subtree", as in I want to replace some HTML with some other HTML in a sane manner. In this context, I want to treat old and new really as HTML, so if old is simply a "word" that does also appear for instance in a class name, I really only want to replace it if it is content in the document, but not if it is a class name as shown above.

Any ideas how to do this?

Upvotes: 1

Views: 124

Answers (1)

Ajax1234
Ajax1234

Reputation: 71471

The solution below works in three parts:

  1. All matches of selector from html are discovered.

  2. Then, each match (as a soup object) is recursively traversed and every child is matched against old.

  3. If the child object is equivalent to old, then it is extracted and new is inserted into the original match at the same index as the child object.


import bs4
from bs4 import BeautifulSoup as soup
def replace(html:str, selector:str, old:str, new:str) -> str:
    def update_html(d:soup, old:soup) -> None:
        i = 0
        while (c:=getattr(d, 'contents', [])[i:]):
            if isinstance((a:=c[0]), bs4.element.NavigableString) and str(old) in str(a):
                a.extract()
                for j, k in enumerate((l:=str(a).split(str(old)))):
                    i += 1
                    d.insert(i, soup(k, 'html.parser'))
                    if j + 1 != len(l):
                        i += 1
                        d.insert(i, soup(new, 'html.parser'))
            elif a == old:
                a.extract()
                d.insert(i, soup(new, 'html.parser'))
                i += 1
            else:
                update_html(a, old)
            i += 1
    source, o = [soup(j, 'html.parser') for j in [html, old]]
    for i in source.select(selector):
        update_html(i, o.contents[0])
    return str(source)

after = replace(
    html = before,
    selector = 'body', # from config text file
    old = 'target', # from config text file
    new = '<span class="special">target</span>', # from config text file
)
print(after)

Output:

<!DOCTYPE html>

<html>
<head>
<title>No target here</title></head>

<body>
<h1>This is the <span class="special">target</span>!</h1>
<p class="target">
        Yet another <b><span class="special">target</span></b>.
    </p>
<p>
<!-- Comment -->
        Foo <span class="special">target</span> Bar
    </p>
</body>
</html>

Upvotes: 1

Related Questions