Literal whitespace characters causing pattern to fail (sometimes)

Question

I have this RegEx from my previous question. The problem is that sometimes it works, Sometimes it doesn't. I tried pasting it on an online simulator and got this: https://regex101.com/r/I3tnY4/3

The text is from a file I read using

file_get_contents

The contents of the file are complete but when I run it through the RegEx to filter it:

        $data = file_get_contents($var);
        $pat  = '/intervals $$\d+$$:\s+\Kxmin = (?P\d+(\.\d+)?) \
                \s+xmax = (?P\d+(\.\d+)?)\s+text = "(?P[^"]*)"/m';

        // print_r($data);
        preg_match_all($pat, $data, $m);
        $result = array_map(function($a){
            return array_combine(['xmin', 'xmax', 'text'], $a);
        }, array_map(null, $m['xmin'], $m['xmax'], $m['text']));

        print_r($result);

it returns an empty array. At first, it was working but when I added a for loop to handle multiple file upload it stopped working.

This also happened before when I tried to process the file right after it was uploaded.

Like this:

if (move_uploaded_file($_FILES["uploadedfile"]["tmp_name"], $target_file)) {
        if (file_exists($target_file)) {   
            $data = file_get_contents($target_file);
            $pat  = '/intervals $$\d+$$:\s+\Kxmin = (?P\d+(\.\d+)?) \
            \s+xmax = (?P\d+(\.\d+)?)\s+text = "(?P[^"]*)"/m';


            preg_match_all($pat, $data, $m);
            $result = array_map(function($a){
               return array_combine(['xmin', 'xmax', 'text'], $a);
            }, array_map(null, $m['xmin'], $m['xmax'], $m['text']));

            print_r($result);
        }
    }

With the above code, the RegEx also failed since the $result array was empty. I figured that was because the file was not yet ready to be read or something. Even though when I printed the contents of the file everything was there. So what I did then was to redirect my page to another file that did the RegEx processing and surprisingly it worked there.

mickmackusa · Accepted Answer

It appears that your task is more focused on substring extraction, rather than validation. For this reason, you can largely reduce the size of your pattern, speed up the execution, and minimize output bloat with the following pattern:

/xmin = (\S+)\s+xmax = (\S+)\s+text = "([^"]*)/

What have I done? (See this demo for official pattern breakdown)

Remove the leading interval... matching since you are not using it (or more specifically the number inside of []:
Remove \K because you don't need to "restart" the fullstring match -- you aren't using it.
Remove the named capture groups because you are using array_map() and array_combine() to assign these key names anyhow. Named capture groups cause major output array bloat, and should be avoided unless you have a compelling reason to use them. The reason they cause bloat is because when you name capture groups, preg_match_all() will write duplicate subarray elements (the named one, and the indexed one) -- this means double the necessary data. While, yes, you can use named capture groups, this would just mean that you would change your mapping process to remove all of the indexed elements from each subarray ([0],[1],[2],[3]).
Remove the break in your pattern. When you want to accommodate one or more whitespace characters (in your case: newlines, spaces, and possibly tabs) just use \s+. For the record, you can use whitespaces in your pattern to improve readability, but to do this you need to include x as a flag at the end of your pattern. The x pattern modifier will ignore ALL whitespaces used in the pattern, so beware of this effect.
Replace (?P\d+(\.\d+)?) with (\S+). This will remove the named capture group and the nested capture group, and extract the entire non-whitespace substring. If you DO want to validate this string, then I advise: (\d+(?:\.\d+)?) This changes the nested group to "non-capturing" -- again reducing output array bloat.
You were wise to use a negated capture group on the last capture group, this is the most efficient way to match it. You don't need the trailing ", so that can be removed.
Remove the m pattern modifier. You aren't using any anchor metacharacters )(^ or $) so the flag has no purpose.
preg_match_all()'s 4th parameter PREG_SET_ORDER will structure your subarrays in such a way that only one array_map() is necessary to set up your multi-dimensional array.

This is how I suggest that you implement it:

Code: (Demo)

$data='intervals [1]:
    xmin = 0 
    xmax = 13.139997023062838 
    text = "" 
intervals [2]:
    xmin = 13.139997023062838 
    xmax = 14.763036269953904 
    text = "Cities are like siblings in a large polygamous family." 
intervals [3]:
    xmin = 14.763036269953904 
    xmax = 17.01 
    text = ""';
$pat='/xmin = (\S+)\s+xmax = (\S+)\s+text = "([^"]*)/';
if(preg_match_all($pat,$data,$m,PREG_SET_ORDER)){
    $assoc_multidim=array_map(function($a){return array_combine(['xmin','xmax','text'],array_slice($a,1));},$m);
    var_export($assoc_multidim);
}else{
    echo "substring extraction failed";
}

Output:

array (
  0 => 
  array (
    'xmin' => '0',
    'xmax' => '13.139997023062838',
    'text' => '',
  ),
  1 => 
  array (
    'xmin' => '13.139997023062838',
    'xmax' => '14.763036269953904',
    'text' => 'Cities are like siblings in a large polygamous family.',
  ),
  2 => 
  array (
    'xmin' => '14.763036269953904',
    'xmax' => '17.01',
    'text' => '',
  ),
)

An alternative method that makes use of your named capture groups would look like this: (Demo)

$pat='/xmin = (?P\S+)\s+xmax = (?P\S+)\s+text = "(?P[^"]*)/';
if(preg_match_all($pat,$data,$m,PREG_SET_ORDER)){
    $assoc_multidim=array_map(function($a){return array_intersect_key($a,['xmin'=>'','xmax'=>'','text'=>'']);},$m);
    var_export($assoc_multidim);
}else{
    echo "substring extraction failed";
}

...You see, both techniques require a little bit of clean up (unless your processes to follow don't mind the indexed subarrays), this is why I favor the less bloated array.

Literal whitespace characters causing pattern to fail (sometimes)

Answers (2)

Related Questions