Frayt
Frayt

Reputation: 1233

BeautifulSoup - Modify contents of Tag

Given the object soup with a value bs4.BeautifulSoup("<tr><td>Hello!</td><td>World!</td></tr>"), how do I remove exclamation marks from all tr tags?

The closest I have got is:

for tr in soup.find_all("tr"):
    tr.string = tr.decode_contents().replace("!", "")

But this results in:

<html><body><tr>&lt;td&gt;Hello&lt;/td&gt;&lt;td&gt;World&lt;/td&gt;</tr></body></html>

Where the angle brackets in decode_contents() are encoded when assigned to tr.string.

I have also tried tr.replace_with(str(tr).replace("!", "")) (using the HTML representation of Tag objects) which gives the same result.

Bear in mind this is a simplified example. While I could iterate over the td tags instead in this specific example, in reality those tags would also contain HTML structures, presenting the same problem.

Upvotes: 1

Views: 2062

Answers (2)

Frayt
Frayt

Reputation: 1233

Did the following:

import bs4

soup = bs4.BeautifulSoup("<tr><td>Hello!</td><td>World!</td></tr>", "html.parser")

for tr in soup.find_all("tr"):
    replaced_tr = str(tr).replace("!", "")
    modified_tr = bs4.BeautifulSoup(replaced_tr, "html.parser").tr
    tr.replace_with(modified_tr)

It seems replace_with does not work with strings of HTML, so you should create a BeautifulSoup object first and use that as the argument of replace_with

Upvotes: 0

ChrisD
ChrisD

Reputation: 3518

You could try iterating through all the string objects that are children of <tr>.

import bs4

soup = bs4.BeautifulSoup("<table><tr><td>Hello!</td><td>World!</td></tr></table>")

for tr in soup.find_all("tr"):
    strings = list(tr.strings)
    for s in strings:
        new_str = s.replace("!", "")
        s.replace_with(new_str)

One issue is that you can't replace the strings returned by .strings without breaking the iterator, which is why I made it a list first. If that's an issue you could iterate in a way that preserves the next element before you replace it, like so:

def iter_strings(elem):
    # iterate strings so that they can be replaced
    iter = elem.strings
    n = next(iter, None)
    while n is not None:
        current = n
        n = next(iter, None)
        yield current

def replace_strings(element, substring, newstring):
    # replace all found `substring`'s with newstring
    for string in iter_strings(element):
        new_str = string.replace(substring, newstring)
        string.replace_with(new_str)

for tr in soup.find_all("tr"):
    replace_strings(soup, "!", "")

Upvotes: 2

Related Questions