Morphine
Morphine

Reputation: 178

How to match and parse specific info in tags with varying form

I need to parse and process input data pushed into our webservice as (UTF-8) text files from a 3rd party. The input files contain tags in the general form

<{ _N ('some_domain_id','this can be an arbitary string',{'a':'b','c':'d'}) }>
-- --   --------------   ------------------------------  -----------------  --
        ^                 ^                              ^
        i need to         |                              this part is 
        extract this      and this (payload)             optional

these tags can appear anyway in the textfile, no assumptions can be made about their distribution and whats between the tags. Also <{,_N and }>are present for any given valid tag, but there might be spaces in between without disrupting the values (e.g. between <{ and _N) With that info, my initial test set was limited and my current implementation is a regex along with a split of the result at the ,

The service is up for a while now and i had to realize that although the vast majority of tags is correctly catched, there is a small number of tags that contain anomalies and are not recognized by this implementation.

Caveats (things that can happen in the payload part):

Here are some sample tags i identified that can not be parsed or produce wrong results (even worse than not recognizing).

<{_N( 'some_domain_id' , 'this can ( be an arbitary ) string',{'a':'b','c':'d'})}>

- Not recognized, note the brackets, they can occur anywhere in the data string, they don't even need to be balanced (Example: https://regex101.com/r/BCiaaj/1)

<{_N( 'some_domain_id' , 'this can  be an arbitary  string' {'a':'b','c':'d'})|e('modifier')}>
<{_N( 'some_domain_id' , 'this can  be an arbitary string')|e('modifier')}>

Experimentally, i tried a few other regexes already, but they always fail on one or more of the tags in my extended testset, e.g.

So my questions as i am not a real regex expert

Any help is highly appreciated!

Upvotes: 1

Views: 77

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626896

You may use

<{\s*_N\s*\(\s*'([^\\']*(?:\\.[^\\']*)*)'\s*,\s*'([^\\']*(?:\\.[^\\']*)*)'\s*(.*?)}>

See the regex demo

Details:

  • <{\s* - a <{ plus 0+ whitespaces
  • _N - tag start
  • \s*\(\s* - a ( enclosed with 0+ whitespaces
  • '([^\\']*(?:\\.[^\\']*)*)' - a single quoted string literal that may contain escaped single quotes and other chars (the inside contents are captured into a capturing group #1)
  • \s*,\s* - a , enclosed with 0+ whitespaces
  • '([^\\']*(?:\\.[^\\']*)*)' - a single quoted string literal that may contain escaped single quotes and other chars (the inside contents are captured into a capturing group #2)
  • \s* - 0+ whitespaces
  • (.*?) - any 0+ chars as few as possible up to the first
  • }> - literal char sequence }>.

Upvotes: 1

Steve_B19
Steve_B19

Reputation: 548

Why could you not just retrieve the parts you need (that being in the single quotes);

//example 1
$str = '<{_N( \'some_domain_id\' , \'this can ( be an arbitary ) string\',{\'a\':\'b\',\'c\':\'d\'})}>';
test_pregex($str);

//example 2
$str = '<{_N(\'some_domain_id\'      , \'this can ( be an arbitary ) string\'  ,{\'a\':\'b\',\'c\':\'d\'})}  >';
test_pregex($str);

//example 3
$str = '<{_N( \'some_domain_id\' , \'this can ( be an arbitary ) string\')|e(\'modifier\')}>';
test_pregex($str, '\'modifier\'');

function test_pregex($str, $optional = "{'a':'b','c':'d'}") {
    $re = '/\'([^\']*?)\'|(\{\'[^\']*?\'.+?})/m';
    preg_match_all($re, $str, $matches);
    $matches = $matches[0];
    var_export($matches);   
    assert($matches[0] == "'some_domain_id'");
    assert($matches[1] == "'this can ( be an arbitary ) string'");
    assert($matches[2] == $optional);
}

Output will be all three cases with no assertion warnings. You can then process what you require further.

Upvotes: 0

Related Questions