How to replace every single HTML section in a large file in Python?

Question

I have several hundreds of long files with repeated blocks of HTML in each that I won't need for my further text analysis, therefore I would like to get rid of them as they occupy quite a lot of valuable memory when analyzing these files.

These HTML blocks are occasionally broken by a newline character. Just like regular HTML, the removable blocks always begin with and end with .


My approach was the following:
content = inputfile.read()
pattern = re.compile('')
match = pattern.findall(content)

However, this always returns only one single match. The regex correctly identifies the very first instance of  and the very last instance of . Thus, even if I have 10,000 HTML blocks across the document that I want to remove using

content = re.sub(pattern, '', content)

only one match has been found and thus, almost my whole file gets removed.
How could I find all the HTML blocks separately throughout the document?
P.S.: I use Python3.x and my OS is Windows 10.

Ronald · Accepted Answer

Regular expressions are greedy by default. That means it searches until it finds the last <\HTML> instance. Change your expression as follows:

pattern = re.compile('', flags=re.DOTALL)

How to replace every single HTML section in a large file in Python?

Answers (1)

Related Questions