conciliator
conciliator

Reputation: 6138

Trying to understand javascript regexp result

I want to parse strings using javascript with two alternative formats:

id#state#{font name, font size, "text"}  
// e.g. button1#hover#{arial.ttf, 20, "Ok"}

or

id#state#text                            
// e.g. button1#hover#Ok

where in the second variant, a default font and size is being assumed.

Before you read further, I have to point out that I control the format, so I'd love to hear about any other format that is more RegExp Friendly™. That being said, the second alternative is needed for historical reasons, as is the id#state#-part. In other words, the flexibility resides in the {font name, font size, "text"}-part.

Furthermore, I'd like to use RegExp as far as possible. Yes, the RegExp I suggest below is pretty hairy, but for my case this is not only a possible solution to the problem at hand but also a matter of learning more about RegExp itself.

My current attempt at grouping the three or alternatively five information elements in the two formats is as follows.

var pat = /^(\w*)#(\w*)#
          (?:(?:\{([\w\.]*),\s*([0-9\.]*),\s*"([\w\s]*)"\})|([\w\s]*))$/;

var source1 = "button1#hover#{arial.ttf, 20, \"Ok\"}";
var source2 = "button1#hover#Ok";

var result1 = source1.match ( pat );
var result2 = source2.match ( pat );

alert ( "Source1: " + result1.length + " Source2: " + result2.length );

When I tested this expression at http://www.regular-expressions.info/javascriptexample.html, I got:

result1 = [ button1#hover#{arial.ttf, 20, "Ok"}, button1, hover, arial.ttf, 
            20, Ok, undefined ]

and

result2 = [ button1#hover#Ok, button1, hover, undefined, 
            undefined, undefined, Ok ]

Here's how I break down the RegExp:

^(\w*)#(\w*)#(?:(?:\{([\w\.]*),\s*([0-9\.]*),\s*"([\w\s]*)"\})|([\w\s]*))$

^                 # anchor to beginning of string
(\w*)             # capture required id
#                 # match hash sign separator
(\w*)             # capture required state
#                 # match hash sign separator
                  # capture text structure with optional part:
(?:(?:\{([\w\.]*),\s*([0-9\.]*),\s*"([\w\s]*)"\})|([\w\s]*))  
$                 # anchor to end of string

The text structure capture is the dodgiest part, I guess. I break it down as follows:

(?:                  # match all of what follows but don't capture
    (?:\{            # match left curly bracket but don't capture (non-capturing group)
          ([\w\.]*)  # capture font name (with possible punctuation in font file name)
          ,\s*       # match comma and zero or more whitespaces
          ([0-9\.]*) # capture font size (with possible decimal part)
          ,\s*"      # match comma, zero or more whitespaces, and a quotation char
          ([\w\s]*)  # capture text including whitespaces
    "\})             # match quotation char and right curly bracket (and close non-capturing group)
    |                # alternation operator
    ([\w\s]*)        # capture optional group to match the second format variant
)                    # close outer non-capturing group

My question is two fold:

1) How can I avoid the trailing undefined match in the result1 case?

2) How can I avoid the three undefined matches in the middle of the result2 case?

Bonus question:

Did I get the break down right? (I guess there is something amiss, since the RegExp isn't working entirely as expected.)

Thanks! :)

Upvotes: 3

Views: 86

Answers (1)

Pointy
Pointy

Reputation: 413757

The groups in your regex are numbered from left to right without regard for the operators (in particular, the | operator). When you've got (x)|(y) then the group for either "x" or "y" will be undefined.

Thus you can't avoid the empty slots in the result. In fact, I think you want them, because otherwise you don't really know which form of input you've matched.

Upvotes: 2

Related Questions