Reputation: 1562
I am currently working on a project in which I want to allow regex search in/on a huge set of HTML files.
After first pinpointing the files of my interest I now want to highlight the found keyword!
Using BeautifulSoup I can determine the Node in which my keyword is found. One thing I do is changing the color of the whole parent.
However, I would also like to add my own <span>-Tags around just they keyword(s) I found.
Determining the position and such is no big deal using the find()-functions provided by BFSoup. But adding my tags around regular text seems to be impossible?
# match = keyword found by another regex
# node = the node I found using the soup.find(text=myRE)
node.parent.setString(node.replace(match, "<myspan>"+match+"</myspan>"))
This way I only add mere text and not a proper Tag, as the document is not freshly parsed, which I hope to avoid!
I hope my problem became a little clear :)
Upvotes: 3
Views: 4579
Reputation: 15345
Here's a simple example showing one way to do it:
import re
from bs4 import BeautifulSoup as Soup
html = '''
<html><body><p>This is a paragraph</p></body></html>
'''
(1) store the text and empty the tag
soup = Soup(html)
text = soup.p.string
soup.p.clear()
print soup
(2) get start and end positions of the words to be boldened (apologies for my English)
match = re.search(r'\ba\b', text)
start, end = match.start(), match.end()
(3) split the text and add the first part
soup.p.append(text[:start])
print soup
(4) create a tag, add the relevant text to it and append it to the parent
b = soup.new_tag('b')
b.append(text[start:end])
soup.p.append(b)
print soup
(5) append the rest of the text
soup.p.append(text[end:])
print soup
here is the output from above:
<html><body><p></p></body></html>
<html><body><p>This is </p></body></html>
<html><body><p>This is <b>a</b></p></body></html>
<html><body><p>This is <b>a</b> paragraph</p></body></html>
Upvotes: 4
Reputation: 7233
If you add the text...
my_tag = node.parent.setString(node.replace(match, "<myspan>"+match+"</myspan>"))
...and pass it through BeautifulSoup once more
new_soup = BeautifulSoup(my_tag)
it should be classified as a BS tag object and available for parsing.
You could apply these changes to the original mass of text and run it through as a whole, to avoid repetition.
EDIT:
From the docs:
# Here is a more complex example that replaces one tag with another:
from BeautifulSoup import BeautifulSoup, Tag
soup = BeautifulSoup("<b>Argh!<a>Foo</a></b><i>Blah!</i>")
tag = Tag(soup, "newTag", [("id", 1)])
tag.insert(0, "Hooray!")
soup.a.replaceWith(tag)
print soup
# <b>Argh!<newTag id="1">Hooray!</newTag></b><i>Blah!</i>
Upvotes: 2