Divya
Divya

Reputation: 71

BeautifulSoup python to parse html files

I am using BeautifulSoup to replace all the commas in an html file with ‚. Here is my code for that:

f = open(sys.argv[1],"r")
data = f.read()

soup = BeautifulSoup(data)

comma = re.compile(',') 


for t in soup.findAll(text=comma):
        t.replaceWith(t.replace(',', '‚'))

This code works except when there is some javascript included in the html file. In that case it even replaces the comma(,) with in the javascript code. Which is not required. I only want to replace in all the text content of the html file.

Upvotes: 0

Views: 4191

Answers (1)

Sean Vieira
Sean Vieira

Reputation: 160073

soup.findall can take a callable:

tags_to_skip = set(["script", "style"])
# Add to this list as needed

def valid_tags(tag):
    """Filter tags on the basis of their tag names

    If the tag name is found in ``tags_to_skip`` then
    the tag is dropped.  Otherwise, it is kept.
    """
    if tag.source.name.lower() not in tags_to_skip:
        return True
    else:
        return False

for t in soup.findAll(valid_tags):
    t.replaceWith(t.replace(',', '‚'))

Upvotes: 5

Related Questions