eonj
eonj

Reputation: 119

javascript regular expression with multiple parentheses

I am trying to write a javascript regex only matching NASM-style comments in HTML. For example, matching "; interrupt" for "INT 21h ; interrupt".

You may know /;.*/ can't be the answer because there can be a HTML entity before the comment; I thought /(?:[^&]|&.+;)*(;.*)$/ should work for it, but I found it has two problems:

  1. "      ; hello world".match(/(?:[^&]|&.+;)*(;.*)$/) is an array ["      ; hello world", "; hello world"]. I don't want an array.
  2. "      ; hello world; a message".match(/(?:[^&]|&.+;)*(;.*)$/) is ["      ; hello world; a message", "; a message"]; even worse the second element.

Question:

  1. Why is (?:) block returned?
  2. Why "; a message", not "; hello world; a message"?
  3. What's the right regex I can use?

Upvotes: 1

Views: 184

Answers (2)

collapsar
collapsar

Reputation: 17238

ad 1.) the ?: block is not returned. instead, the complete match is returned in the first array element. this behavior follows the specification for non-global matching (ie. without g option).

ad 2.) the first part of your regex ((?:[^&]|&.+;)*) matches too much. in fact it would match the complete line if you dropped the second portion. in plain english you asked to match a sequence of & followed by as many characters as possible followed by a ;, or any symbol other than &, respectively, and you ask the engine to repeat this match as often as possible until the last ; in the test string (if there is one).

ad 3.) try

(?:[^&;]*(&[a-zA-Z0-9_-]+;[^&;]*)*)(;.*)$

it fixes the broken entity matching and returns the longest ;-initial suffix.

tested with pagecolumn regex tester (i'm not affiliated with this website).

Upvotes: 0

Tom Lord
Tom Lord

Reputation: 28285

1) The (?:) is not being returned. What you are seeing is that the .match() method will always return an array: The first element is the whole match, and the following elements (if any) are the back-references. In this case, you have one back-reference, so the array contains two items.

2) Because of the first half of your regex:

(?:[^&]|&.+;)*

This is not a good idea! This will match just about anything, even including new lines! In fact, the only thing it won't match is a "&" that is not followed by a ";" on the same line. Thus, it is matching everything up to the last ";" in each of your lines.

3) I'm not at all familiar with MASM-style comments in HTML, so I'd need to see a more extensive list of what you want matched/not matched in order to confidently give a good answer here.

But here's something I've thrown together very quickly, to at least solve the two examples you gave above:

.*&.*?;\s(;.*)$

Upvotes: 1

Related Questions