markE
markE

Reputation: 105015

Use RegEx to parse a string with complicated delimiting

This is a RegEx question.

Thanks for any help and please be patient as RegEx is definitely not my strength !

Entirely as background...my reason for asking is that I want to use RegEx to parse strings similar to SVG path data segments. I’ve looked for previous answers that parse both the segments and their segment-attributes, but found nothing that does the latter properly.

Here are some example strings like the ones I need to parse:

M-11.11,-22
L.33-44  
ac55         66 
h77  
M88 .99  
Z 

I need to have the strings parsed into arrays like this:

["M", -11.11, -22]
["L", .33, -44]
["ac", 55, 66]
["h", 77]
["M", 88, .99]
["Z"]

So far I found this code on this answer: Parsing SVG "path" elements with C# - are there libraries out there to do this? The post is C#, but the regex was useful in javascript:

var argsRX = /[\s,]|(?=-)/; 
var args = segment.split(argsRX);

Here's what I get:

 [ "M", -11.11, -22, <empty element>  ]
 [ "L.33", -44, <empty>, <empty> ]
 [ "ac55", <empty>, <empty>, <empty>, 66 <empty>  ]
 [ "h77", <empty>, <empty>  
 [ "M88", .99, <empty>, <empty> ]
 [ "Z", <empty> ]

Problems when using this regex:

Here are more complete definitions of incoming strings:

Here is test code I've been using:

<!doctype html>
<html>
<head>
<link rel="stylesheet" type="text/css" media="all" href="css/reset.css" /> <!-- reset css -->
<script type="text/javascript" src="http://code.jquery.com/jquery.min.js"></script>

<style>
    body{ background-color: ivory; }
</style>

<script>
    $(function(){


var pathData = "M-11.11,-22 L.33-44  ac55    66 h77  M88 .99  Z" 

// separate pathData into segments
var segmentRX = /[a-z]+[^a-z]*/ig;
var segments = pathData.match(segmentRX);

for(var i=0;i<segments.length;i++){
    var segment=segments[i];
    //console.log(segment);

    var argsRX = /[\s,]|(?=-)/; 
    var args = segment.split(argsRX);
    for(var j=0;j<args.length;j++){
        var arg=args[j];
        console.log(arg.length+": "+arg);
    }

}

    }); // end $(function(){});
</script>

</head>

<body>
</body>
</html>

Upvotes: 2

Views: 1023

Answers (5)

Tomalak
Tomalak

Reputation: 338178

^([a-z]+)(?:(-?\d*.?\d+)[^\d\n\r.-]*(-?\d*.?\d+)?)?

Explanation

^               # start of string
([a-z]+)        # any number of characters, match into group 1
(?:             # non-capturing group
  (-?\d*.?\d+)  #   first number (optional singn & decimal point, digits)
  [^\d\n\r.-]*  #   delimiting characters (anything but these)
  (-?\d*.?\d+)? #   second number
)?              # end non-capturing group, make optional

Use with "case insensitive" flag.

Upvotes: 3

Niet the Dark Absol
Niet the Dark Absol

Reputation: 324630

Your "pattern" consists of one or more letters, followed by a decimal number, followed by another delimited by either a comma or whitespace.

Regex: /([a-z]+)(-?(?:\d*\.)?\d+)(?:[,\s]+|(?=-))(-?(?:\d*\.)?\d+)/i

Upvotes: 2

Joseph Myers
Joseph Myers

Reputation: 6552

I had to perform very similar parsing of data for reporting live results at the nation's largest track meet. http://ksathletics.com/2013/statetf/liveresults.js Although there was a lot of both client and server-side code involved, the principles are the same. In fact, the kind of data was practically identical.

I suggest that you do not use one "jumbo" regular expression, but rather one expression which separates data pieces and another which breaks each data piece into its main identifier and the following values. This solves the problem of various delimiters by allowing the second-level regular expression to match the definition of data values rather than having to distinguish delimiters. (This also is more efficient than putting all of the logic into a single regular expression.)

This is a solution tested to work on the input you gave.

<script>
var pathData = "M-11.11,-22 L.33-44  ac55    66 h77  M88 .99  Z" 

function parseData(pathData) {
    var pieces = pathData.match(/([a-z]+[-.,\d ]*)/gi), i;
    /* now parse each piece into its own array */
    for (i=0; i<pieces.length; i++)
        pieces[i] = pieces[i].match(/([a-z]+|-?[.\d]*\d)/gi);
    return pieces;
}

pathPieces = parseData(pathData);
document.write(pathPieces.join('<br />'));
console.log(pathPieces);
</script>

http://dropoff.us/private/1370846040-1-test-path-data.html

Update: The results are exactly equivalent to the specified output you want. One thought that came to mind, however, was whether you also want or need type conversion from strings to numbers. Do you need that as well? I'm just thinking of the next step beyond parsing the data.

Upvotes: 4

Markus Jarderot
Markus Jarderot

Reputation: 89171

function parsePathData(pathData)
{
    var tokenizer = /([a-z]+)|([+-]?(?:\d+\.?\d*|\.\d+))/gi,
        match,
        current,
        commands = [];

    tokenizer.lastIndex = 0;
    while (match = tokenizer.exec(pathData))
    {
        if (match[1])
        {
            if (current) commands.push(current);
            current = [ match[1] ];
        }
        else
        {
            if (!current) current = [];
            current.push(match[2]);
        }
    }
    if (current) commands.push(current);
    return commands;
}

var pathData = "M-11.11,-22 L.33-44  ac55    66 h77  M88 .99  Z";
var commands = parsePathData(pathData);
console.log(commands);

Output:

[ [ "M", "-11.11", "-22" ],
  [ "L", ".33", "-44" ],
  [ "ac", "55", "66" ],
  [ "h", "77" ],
  [ "M", "88", ".99" ],
  [ "Z" ] ]

Upvotes: 2

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89557

You can try with this pattern:

/([a-z]+)(-?(?:\d*\.)?\d+)?(?:\s+|,|(-(?:\d*\.)?\d+))?(-?(?:\d*\.)?\d+)?/

(a bit long, but it seems to work)

Note that the last number can be in the capture group \3 or \4

Upvotes: 1

Related Questions