Reputation: 69
This usually is no hard task, but today I can't seem to remove a simple javascript tag..
The example I'm working with (formated):
<section class="realestate oca"></section>
<script type="text/javascript" data-type="ad">
window.addEventListener('DOMContentLoaded', function(){
window.postscribe && postscribe(document.querySelector(".realestate"),
'<script src="https://ocacache-front.schibsted.tech/public/dist/oca-loader/js/ocaloader.js?type=re&w=100%&h=300"><\/script>');
});
</script>
The example I'm working with (raw)
<section class="realestate oca"></section>\n<script type="text/javascript" data-type="ad">\n\twindow.addEventListener(\'DOMContentLoaded\', function(){\n\t\twindow.postscribe && postscribe(document.querySelector(".realestate"),\n\t\t\'<script src="https://ocacache-front.schibsted.tech/public/dist/oca-loader/js/ocaloader.js?type=re&w=100%&h=300"><\\/script>\');\n\t});\n</script>
I would like to remove everything from <script
(beginning of second line) to </script>
(last line). This will output only the first line, <section..>
.
Here's my line of code:
re.sub(r'<script[^</script>]+</script>', '', text)
#or
re.sub(r'<script.+?</script>', '', text)
I'm clearly missing something, but I can't see what.
Note: The document I'm working with contains mainly plain text so no parsing with lxml or similar is needed.
Upvotes: 0
Views: 2647
Reputation: 7840
Your first regex didn't work because character classes ([...]
) are a collection of characters, not a string. So it will only match if it finds <script
separated from </script>
by a string of characters that doesn't include any of <
, /
, s
, c
, etc.
Your second regex is better, and the only reason it's not working is because by default, the .
wildcard does not match newlines. To tell it you want it to, you'll need to add the DOTALL
flag:
re.sub(r'<script.+?</script>', '', text, flags=re.DOTALL)
Upvotes: 3