Mattijs
Mattijs

Reputation: 3430

finding text between <script></script> tags with RegEx for Coldfusion including linebreaks

I am trying to extract javascript code from HTML content that I receive via CFHTTP request.

I have this simple regex that catches everyting as long as there is no linebreak in the code between the tags.

var result=REMatch("<script[^>]*>(.*?)</script>",html);

This will catch:

<script>testtesttest</script<

but not

<script>
testtest

</script>

I have tried to use (?m) for multiline, but it doesn't work like that. I am using the reference to figure it out but I am just not getting it with regex.

Heads up, normally there would be javascript between the script tags, not simple text so also characters like {}();:-_ etc.

Can anyone help me out?

Cheers

[[UPDATE]] Thanks guys, I will try the solutions. I favor regex because but I will look into the HTML Parser too.

Upvotes: 2

Views: 5969

Answers (2)

Peter Boughton
Peter Boughton

Reputation: 112220

(?m) multiline mode is for making ^ and $ match on line breaks (not just start/end of string as is default), but what you're trying to do here is make . include newlines - for that you want (?s) (dot-all mode).

However, I probably wouldn't do this with regex - a HTML parser is a more robust solution. Here's how to do it with jSoup:

var result = jsoup.parse(html).select('script').text();

More details on using jSoup in CF are available here, or alternatively you can use the TagSoup parser, which ships with CF10 (so you don't need to worry about jars/etc).


If you really want regex, then you can use this:

var result = rematch('<script[^>]*>(?:[^<]+|<(?!/script>))+',html);

Unlike using (?s).*? this avoids matching empty blocks (but it will still fail in certain edge cases - if accuracy is required use a HTML parser).

To extract just the text from the first script block, you can strip the script tag with this:

result = ListRest( result[1] , '>' );

Upvotes: 8

pogo
pogo

Reputation: 1550

You can use dot matches all mode or replace . with [\s\S] to get the same effect.

<script[^>]*>[\s\S]*?</script> would match everything including newlines.

Upvotes: 0

Related Questions