lazarea
lazarea

Reputation: 1349

How to replace every single HTML section in a large file in Python?

I have several hundreds of long files with repeated blocks of HTML in each that I won't need for my further text analysis, therefore I would like to get rid of them as they occupy quite a lot of valuable memory when analyzing these files.

These HTML blocks are occasionally broken by a newline character. Just like regular HTML, the removable blocks always begin with <!DOCTYPE and end with </html>.

My approach was the following:

content = inputfile.read()
pattern = re.compile('<!DOCTYPE.*[\s\S]*<\/html>')
match = pattern.findall(content)

However, this always returns only one single match. The regex correctly identifies the very first instance of <!DOCTYPE and the very last instance of </html>. Thus, even if I have 10,000 HTML blocks across the document that I want to remove using

content = re.sub(pattern, '', content)

only one match has been found and thus, almost my whole file gets removed.

How could I find all the HTML blocks separately throughout the document?

P.S.: I use Python3.x and my OS is Windows 10.

Upvotes: 0

Views: 36

Answers (1)

Ronald
Ronald

Reputation: 3325

Regular expressions are greedy by default. That means it searches until it finds the last <\HTML> instance. Change your expression as follows:

pattern = re.compile('<!DOCTYPE.*?<\/html>', flags=re.DOTALL)

Upvotes: 1

Related Questions