theusual
theusual

Reputation: 69

Python - Remove HTML-tag with regex

This usually is no hard task, but today I can't seem to remove a simple javascript tag..

The example I'm working with (formated):

<section class="realestate oca"></section>
<script type="text/javascript" data-type="ad">
    window.addEventListener('DOMContentLoaded', function(){
        window.postscribe && postscribe(document.querySelector(".realestate"),
        '<script src="https://ocacache-front.schibsted.tech/public/dist/oca-loader/js/ocaloader.js?type=re&w=100%&h=300"><\/script>');
    });
</script>

The example I'm working with (raw)

<section class="realestate oca"></section>\n<script type="text/javascript" data-type="ad">\n\twindow.addEventListener(\'DOMContentLoaded\', function(){\n\t\twindow.postscribe && postscribe(document.querySelector(".realestate"),\n\t\t\'<script src="https://ocacache-front.schibsted.tech/public/dist/oca-loader/js/ocaloader.js?type=re&w=100%&h=300"><\\/script>\');\n\t});\n</script>

I would like to remove everything from <script(beginning of second line) to </script>(last line). This will output only the first line, <section..>.

Here's my line of code:

re.sub(r'<script[^</script>]+</script>', '', text)
#or
re.sub(r'<script.+?</script>', '', text)

I'm clearly missing something, but I can't see what.
Note: The document I'm working with contains mainly plain text so no parsing with lxml or similar is needed.

Upvotes: 0

Views: 2647

Answers (1)

glibdud
glibdud

Reputation: 7840

Your first regex didn't work because character classes ([...]) are a collection of characters, not a string. So it will only match if it finds <script separated from </script> by a string of characters that doesn't include any of <, /, s, c, etc.

Your second regex is better, and the only reason it's not working is because by default, the . wildcard does not match newlines. To tell it you want it to, you'll need to add the DOTALL flag:

re.sub(r'<script.+?</script>', '', text, flags=re.DOTALL)

Upvotes: 3

Related Questions