Reputation: 1772
I have basically an RSS indexing app written in Python that stores the RSS content as a blurb in the DB. When the app initially processed the article contents, it commented out all links that didn't match certain criteria, for example:
<a href="http://google.com">Google</a>
Became:
<!--<a href="http://google.com">Google</a>--> Google
Now I need to process all these old articles and modify the links. So using BeautifulSoup 4 I can easily find the comments using:
links = soup.findAll(text=lambda text:isinstance(text, Comment))
for link in links:
text = re.sub('<[^>]*>', '', link.string)
# any html in the link tag was escaped by BS4, so need to convert back
text = text.replace('&lt;','<')
text = text.replace('&gt;','>')
find = link.string + " " + text
The ouput of "find" above is:
<!--<a href="http://google.com">Google</a>--> Google
Which makes it easier to perform a .replace()
on the content.
Now the problem I'm having (and I'm sure this is simple) is multi-line find/replacing. When Beautiful Soup initial commented out the links, some were converted to:
<!--<a href="http://google.com">Google
</a>--> Google
or
<!--<a href="http://google.com">Google</a>-->
Google
So obviously, replace(old,new)
won't work since replace()
doesn't cover multi-lines.
Can someone help me out with a regex multi-line find/replace? It should be case-sensitive.
Upvotes: 1
Views: 986