Reputation: 6138
I want to parse strings using javascript with two alternative formats:
id#state#{font name, font size, "text"}
// e.g. button1#hover#{arial.ttf, 20, "Ok"}
or
id#state#text
// e.g. button1#hover#Ok
where in the second variant, a default font and size is being assumed.
Before you read further, I have to point out that I control the format, so I'd love to hear about any other format that is more RegExp Friendly™. That being said, the second alternative is needed for historical reasons, as is the id#state#
-part. In other words, the flexibility resides in the {font name, font size, "text"}
-part.
Furthermore, I'd like to use RegExp as far as possible. Yes, the RegExp I suggest below is pretty hairy, but for my case this is not only a possible solution to the problem at hand but also a matter of learning more about RegExp itself.
My current attempt at grouping the three or alternatively five information elements in the two formats is as follows.
var pat = /^(\w*)#(\w*)#
(?:(?:\{([\w\.]*),\s*([0-9\.]*),\s*"([\w\s]*)"\})|([\w\s]*))$/;
var source1 = "button1#hover#{arial.ttf, 20, \"Ok\"}";
var source2 = "button1#hover#Ok";
var result1 = source1.match ( pat );
var result2 = source2.match ( pat );
alert ( "Source1: " + result1.length + " Source2: " + result2.length );
When I tested this expression at http://www.regular-expressions.info/javascriptexample.html, I got:
result1 = [ button1#hover#{arial.ttf, 20, "Ok"}, button1, hover, arial.ttf,
20, Ok, undefined ]
and
result2 = [ button1#hover#Ok, button1, hover, undefined,
undefined, undefined, Ok ]
Here's how I break down the RegExp:
^(\w*)#(\w*)#(?:(?:\{([\w\.]*),\s*([0-9\.]*),\s*"([\w\s]*)"\})|([\w\s]*))$
^ # anchor to beginning of string
(\w*) # capture required id
# # match hash sign separator
(\w*) # capture required state
# # match hash sign separator
# capture text structure with optional part:
(?:(?:\{([\w\.]*),\s*([0-9\.]*),\s*"([\w\s]*)"\})|([\w\s]*))
$ # anchor to end of string
The text structure capture is the dodgiest part, I guess. I break it down as follows:
(?: # match all of what follows but don't capture
(?:\{ # match left curly bracket but don't capture (non-capturing group)
([\w\.]*) # capture font name (with possible punctuation in font file name)
,\s* # match comma and zero or more whitespaces
([0-9\.]*) # capture font size (with possible decimal part)
,\s*" # match comma, zero or more whitespaces, and a quotation char
([\w\s]*) # capture text including whitespaces
"\}) # match quotation char and right curly bracket (and close non-capturing group)
| # alternation operator
([\w\s]*) # capture optional group to match the second format variant
) # close outer non-capturing group
My question is two fold:
1) How can I avoid the trailing undefined match in the result1 case?
2) How can I avoid the three undefined matches in the middle of the result2 case?
Bonus question:
Did I get the break down right? (I guess there is something amiss, since the RegExp isn't working entirely as expected.)
Thanks! :)
Upvotes: 3
Views: 86
Reputation: 413757
The groups in your regex are numbered from left to right without regard for the operators (in particular, the |
operator). When you've got (x)|(y)
then the group for either "x" or "y" will be undefined
.
Thus you can't avoid the empty slots in the result. In fact, I think you want them, because otherwise you don't really know which form of input you've matched.
Upvotes: 2