Python - Remove HTML-tag with regex

Question

This usually is no hard task, but today I can't seem to remove a simple javascript tag..

The example I'm working with (formated):


');
    });

The example I'm working with (raw)

I would like to remove everything from (beginning of second line) to (last line). This will output only the first line, .



Here's my line of code:

re.sub(r']+', '', text)
#or
re.sub(r'', '', text)


I'm clearly missing something, but I can't see what.

Note: The document I'm working with contains mainly plain text so no parsing with lxml or similar is needed.

glibdud · Accepted Answer

Your first regex didn't work because character classes ([...]) are a collection of characters, not a string. So it will only match if it finds separated from by a string of characters that doesn't include any of <, /, s, c, etc.



Your second regex is better, and the only reason it's not working is because by default, the . wildcard does not match newlines. To tell it you want it to, you'll need to add the DOTALL flag:

re.sub(r'', '', text, flags=re.DOTALL)

Python - Remove HTML-tag with regex

Answers (1)

Related Questions