How to match and parse specific info in tags with varying form

Question

I need to parse and process input data pushed into our webservice as (UTF-8) text files from a 3rd party. The input files contain tags in the general form

<{ _N ('some_domain_id','this can be an arbitary string',{'a':'b','c':'d'}) }>
-- --   --------------   ------------------------------  -----------------  --
        ^                 ^                              ^
        i need to         |                              this part is 
        extract this      and this (payload)             optional

these tags can appear anyway in the textfile, no assumptions can be made about their distribution and whats between the tags. Also <{,_N and }>are present for any given valid tag, but there might be spaces in between without disrupting the values (e.g. between <{ and _N) With that info, my initial test set was limited and my current implementation is a regex along with a split of the result at the ,

Regex /<{\s*_N\s*$([^$]*)\)\s*}>?/g (Example: https://regex101.com/r/NuJD2V/1)
Then split resulting match 'some_domain_id' , 'this can be an arbitary string',{'a':'b','c':'d'}with str_getcsv($match,',','\'','\')
use first two segments of str_getcsv result, dicard other results as they are optional
After that, some_domain_id and this can be an arbitary string can be trimmed and processed as needed

The service is up for a while now and i had to realize that although the vast majority of tags is correctly catched, there is a small number of tags that contain anomalies and are not recognized by this implementation.

Caveats (things that can happen in the payload part):

brackets in payload
escaped quotes in payload \'
optional modifiers after the outermost brackets of the _N call (see below)

Here are some sample tags i identified that can not be parsed or produce wrong results (even worse than not recognizing).

<{_N( 'some_domain_id' , 'this can ( be an arbitary ) string',{'a':'b','c':'d'})}>

- Not recognized, note the brackets, they can occur anywhere in the data string, they don't even need to be balanced (Example: https://regex101.com/r/BCiaaj/1)

<{_N( 'some_domain_id' , 'this can  be an arbitary  string' {'a':'b','c':'d'})|e('modifier')}>
<{_N( 'some_domain_id' , 'this can  be an arbitary string')|e('modifier')}>

Not recognized, note the extra (optional) modifier after the outermost brackets of the _N element. the modifier can consist of different letters (e,r,w) and an arbitary string argument, also there might be spaces around the chain operator | (Example: https://regex101.com/r/XmR2uO/1)

Experimentally, i tried a few other regexes already, but they always fail on one or more of the tags in my extended testset, e.g.

/_N\s*($\s*(?:\(??[^(]*?\s*$))+/ - catches the modifier case, but fails on brackets in the relevant string

So my questions as i am not a real regex expert

is this solvable with a regex and if so, can anyone hint me in the right direction?
is there a better solution viable in vanilla php 7+ without installing/using some external library

Any help is highly appreciated!

Wiktor Stribiżew · Accepted Answer

You may use

<{\s*_N\s*\(\s*'([^\']*(?:\.[^\']*)*)'\s*,\s*'([^\']*(?:\.[^\']*)*)'\s*(.*?)}>

See the regex demo

Details:

<{\s* - a <{ plus 0+ whitespaces
_N - tag start
\s*\(\s* - a ( enclosed with 0+ whitespaces
'([^\']*(?:\.[^\']*)*)' - a single quoted string literal that may contain escaped single quotes and other chars (the inside contents are captured into a capturing group #1)
\s*,\s* - a , enclosed with 0+ whitespaces
'([^\']*(?:\.[^\']*)*)' - a single quoted string literal that may contain escaped single quotes and other chars (the inside contents are captured into a capturing group #2)
\s* - 0+ whitespaces
(.*?) - any 0+ chars as few as possible up to the first
}> - literal char sequence }>.

How to match and parse specific info in tags with varying form

Answers (2)

Related Questions