Reputation: 178
I need to parse and process input data pushed into our webservice as (UTF-8) text files from a 3rd party. The input files contain tags in the general form
<{ _N ('some_domain_id','this can be an arbitary string',{'a':'b','c':'d'}) }>
-- -- -------------- ------------------------------ ----------------- --
^ ^ ^
i need to | this part is
extract this and this (payload) optional
these tags can appear anyway in the textfile, no assumptions can be made about their distribution and whats between the tags. Also <{
,_N
and }>
are present for any given valid tag, but there might be spaces in between without disrupting the values (e.g. between <{
and _N
)
With that info, my initial test set was limited and my current implementation is a regex along with a split of the result at the ,
/<{\s*_N\s*\(([^\)]*)\)\s*}>?/g
(Example: https://regex101.com/r/NuJD2V/1)'some_domain_id' , 'this can be an arbitary string',{'a':'b','c':'d'}
with str_getcsv($match,',','\'','\\')
str_getcsv
result, dicard other results as they are optionalsome_domain_id
and this can be an arbitary string
can be trimmed and processed as neededThe service is up for a while now and i had to realize that although the vast majority of tags is correctly catched, there is a small number of tags that contain anomalies and are not recognized by this implementation.
Caveats (things that can happen in the payload part):
\'
Here are some sample tags i identified that can not be parsed or produce wrong results (even worse than not recognizing).
<{_N( 'some_domain_id' , 'this can ( be an arbitary ) string',{'a':'b','c':'d'})}>
- Not recognized, note the brackets, they can occur anywhere in the data string, they don't even need to be balanced (Example: https://regex101.com/r/BCiaaj/1)
<{_N( 'some_domain_id' , 'this can be an arbitary string' {'a':'b','c':'d'})|e('modifier')}>
<{_N( 'some_domain_id' , 'this can be an arbitary string')|e('modifier')}>
|
(Example: https://regex101.com/r/XmR2uO/1)Experimentally, i tried a few other regexes already, but they always fail on one or more of the tags in my extended testset, e.g.
/_N\s*(\(\s*(?:\(??[^(]*?\s*\)))+/
- catches the modifier case, but fails on brackets in the relevant stringSo my questions as i am not a real regex expert
Any help is highly appreciated!
Upvotes: 1
Views: 77
Reputation: 626896
You may use
<{\s*_N\s*\(\s*'([^\\']*(?:\\.[^\\']*)*)'\s*,\s*'([^\\']*(?:\\.[^\\']*)*)'\s*(.*?)}>
See the regex demo
Details:
<{\s*
- a <{
plus 0+ whitespaces_N
- tag start\s*\(\s*
- a (
enclosed with 0+ whitespaces'([^\\']*(?:\\.[^\\']*)*)'
- a single quoted string literal that may contain escaped single quotes and other chars (the inside contents are captured into a capturing group #1)\s*,\s*
- a ,
enclosed with 0+ whitespaces'([^\\']*(?:\\.[^\\']*)*)'
- a single quoted string literal that may contain escaped single quotes and other chars (the inside contents are captured into a capturing group #2)\s*
- 0+ whitespaces(.*?)
- any 0+ chars as few as possible up to the first}>
- literal char sequence }>
.Upvotes: 1
Reputation: 548
Why could you not just retrieve the parts you need (that being in the single quotes);
//example 1
$str = '<{_N( \'some_domain_id\' , \'this can ( be an arbitary ) string\',{\'a\':\'b\',\'c\':\'d\'})}>';
test_pregex($str);
//example 2
$str = '<{_N(\'some_domain_id\' , \'this can ( be an arbitary ) string\' ,{\'a\':\'b\',\'c\':\'d\'})} >';
test_pregex($str);
//example 3
$str = '<{_N( \'some_domain_id\' , \'this can ( be an arbitary ) string\')|e(\'modifier\')}>';
test_pregex($str, '\'modifier\'');
function test_pregex($str, $optional = "{'a':'b','c':'d'}") {
$re = '/\'([^\']*?)\'|(\{\'[^\']*?\'.+?})/m';
preg_match_all($re, $str, $matches);
$matches = $matches[0];
var_export($matches);
assert($matches[0] == "'some_domain_id'");
assert($matches[1] == "'this can ( be an arbitary ) string'");
assert($matches[2] == $optional);
}
Output will be all three cases with no assertion warnings. You can then process what you require further.
Upvotes: 0