Reputation: 119
I am trying to write a javascript regex only matching NASM-style comments in HTML. For example, matching "; interrupt"
for "INT 21h ; interrupt"
.
You may know /;.*/
can't be the answer because there can be a HTML entity before the comment; I thought /(?:[^&]|&.+;)*(;.*)$/
should work for it, but I found it has two problems:
" ; hello world".match(/(?:[^&]|&.+;)*(;.*)$/)
is an array [" ; hello world", "; hello world"]
. I don't want an array." ; hello world; a message".match(/(?:[^&]|&.+;)*(;.*)$/)
is [" ; hello world; a message", "; a message"]
; even worse the second element.Question:
(?:)
block returned?"; a message"
, not "; hello world; a message"
?Upvotes: 1
Views: 184
Reputation: 17238
ad 1.)
the ?:
block is not returned. instead, the complete match is returned in the first array element. this behavior follows the specification for non-global matching (ie. without g
option).
ad 2.)
the first part of your regex ((?:[^&]|&.+;)*
) matches too much. in fact it would match the complete line if you dropped the second portion. in plain english you asked to match a sequence of &
followed by as many characters as possible followed by a ;
, or any symbol other than &
, respectively, and you ask the engine to repeat this match as often as possible until the last ;
in the test string (if there is one).
ad 3.) try
(?:[^&;]*(&[a-zA-Z0-9_-]+;[^&;]*)*)(;.*)$
it fixes the broken entity matching and returns the longest ;
-initial suffix.
tested with pagecolumn regex tester (i'm not affiliated with this website).
Upvotes: 0
Reputation: 28285
1) The (?:) is not being returned. What you are seeing is that the .match() method will always return an array: The first element is the whole match, and the following elements (if any) are the back-references. In this case, you have one back-reference, so the array contains two items.
2) Because of the first half of your regex:
(?:[^&]|&.+;)*
This is not a good idea! This will match just about anything, even including new lines! In fact, the only thing it won't match is a "&" that is not followed by a ";" on the same line. Thus, it is matching everything up to the last ";" in each of your lines.
3) I'm not at all familiar with MASM-style comments in HTML, so I'd need to see a more extensive list of what you want matched/not matched in order to confidently give a good answer here.
But here's something I've thrown together very quickly, to at least solve the two examples you gave above:
.*&.*?;\s(;.*)$
Upvotes: 1