Reputation: 1349
I have several hundreds of long files with repeated blocks of HTML in each that I won't need for my further text analysis, therefore I would like to get rid of them as they occupy quite a lot of valuable memory when analyzing these files.
These HTML blocks are occasionally broken by a newline character. Just like regular HTML, the removable blocks always begin with <!DOCTYPE
and end with </html>
.
My approach was the following:
content = inputfile.read()
pattern = re.compile('<!DOCTYPE.*[\s\S]*<\/html>')
match = pattern.findall(content)
However, this always returns only one single match. The regex correctly identifies the very first instance of <!DOCTYPE
and the very last instance of </html>
. Thus, even if I have 10,000 HTML blocks across the document that I want to remove using
content = re.sub(pattern, '', content)
only one match has been found and thus, almost my whole file gets removed.
How could I find all the HTML blocks separately throughout the document?
P.S.: I use Python3.x and my OS is Windows 10.
Upvotes: 0
Views: 36
Reputation: 3325
Regular expressions are greedy by default. That means it searches until it finds the last <\HTML>
instance. Change your expression as follows:
pattern = re.compile('<!DOCTYPE.*?<\/html>', flags=re.DOTALL)
Upvotes: 1