Reputation: 3430
I am trying to extract javascript code from HTML content that I receive via CFHTTP request.
I have this simple regex that catches everyting as long as there is no linebreak in the code between the tags.
var result=REMatch("<script[^>]*>(.*?)</script>",html);
This will catch:
<script>testtesttest</script<
but not
<script>
testtest
</script>
I have tried to use (?m) for multiline, but it doesn't work like that. I am using the reference to figure it out but I am just not getting it with regex.
Heads up, normally there would be javascript between the script tags, not simple text so also characters like {}();:-_ etc.
Can anyone help me out?
Cheers
[[UPDATE]] Thanks guys, I will try the solutions. I favor regex because but I will look into the HTML Parser too.
Upvotes: 2
Views: 5969
Reputation: 112220
(?m)
multiline mode is for making ^
and $
match on line breaks (not just start/end of string as is default), but what you're trying to do here is make .
include newlines - for that you want (?s)
(dot-all mode).
However, I probably wouldn't do this with regex - a HTML parser is a more robust solution. Here's how to do it with jSoup:
var result = jsoup.parse(html).select('script').text();
More details on using jSoup in CF are available here, or alternatively you can use the TagSoup parser, which ships with CF10 (so you don't need to worry about jars/etc).
If you really want regex, then you can use this:
var result = rematch('<script[^>]*>(?:[^<]+|<(?!/script>))+',html);
Unlike using (?s).*?
this avoids matching empty blocks (but it will still fail in certain edge cases - if accuracy is required use a HTML parser).
To extract just the text from the first script block, you can strip the script tag with this:
result = ListRest( result[1] , '>' );
Upvotes: 8
Reputation: 1550
You can use dot matches all mode or replace .
with [\s\S]
to get the same effect.
<script[^>]*>[\s\S]*?</script> would match everything including newlines.
Upvotes: 0