Reputation: 761

Regex to parse line with and capture string and comma separated number

I am trying to parse a file with lines similar to:

       John David James (DEM) .  .  .  .  .  .     7,808   10.51
       Marvin D. Scott (DEM)  .  .  .  .  .  .     6,548    9.55
       Maria "Mary" Williams (DEM)  .  .  .  .     4,551    8.58
       Dwayne R. Johnson.  .  .  .  .  .  .  .     4,322    8.22
       WRITE-IN.  .  .  .  .  .  .  .  .  .  .       188     .29

I need to capture the name and the number in the first column. The end result would be

John David James (DEM),7808
Marvin D. Scott (DEM),6548
Maria "Mary" Williams (DEM),4551
Dwayne R. Johnson,4322
WRITE-IN,188

I've tried

\s*\b(.*)\b(\s*\.\s*.*)(\d+,\d+|\d+)\b
\s*\b(.*)\b(\.|.\s)+\b(\d+,\d+|\d+)\b

Any suggestions?

Upvotes: 1

Answers (3)

Salman Arshad

Reputation: 272106

If the data is column aligned (all columns have known, fixed width) then use string functions such as substr:

<?php
$lines = '
       John David James (DEM) .  .  .  .  .  .     7,808   10.51
       Marvin D. Scott (DEM)  .  .  .  .  .  .     6,548    9.55
       Maria "Mary" Williams (DEM)  .  .  .  .     4,551    8.58
       Dwayne R. Johnson.  .  .  .  .  .  .  .     4,322    8.22
       WRITE-IN.  .  .  .  .  .  .  .  .  .  .       188     .29
';

foreach(preg_split('/(\\r|\\n)+/', $lines) as $line) {
    if ($line === '') continue;
    $name = substr($line, 0, 46);
    $amount = substr($line, 46, 10);
    $name = rtrim(ltrim($name), " .");
    $amount = (float) str_replace(",", "", $amount);
    echo $name . ", " . $amount;
}

Upvotes: 1

Amessihel

Reputation: 6384

You can achieve it with an UNGREEDY regexp.

Here, when we catch the name, we want "a sequence of any character followed by a sequence of dots and spaces". So here is the equivalent regexp: (.+)[. ]*.

But the engine is set in greedy mode default. What will happen? The first part (.+) won't stop at the first dot or the first space encountered. Why? Because it is possible to perform the whole regular expression to the end of the line, and the engine will take this path as it is in greedy mode.

Same goes with the whole regexp you can see in the working code below. The first capturing group will capture beyond the name field.

We need to tell him to "eat" the less matchable part.

<?php
$lines = '
       John David James (DEM) .  .  .  .  .  .     7,808   10.51
       Marvin D. Scott (DEM)  .  .  .  .  .  .     6,548    9.55
       Maria "Mary" Williams (DEM)  .  .  .  .     4,551    8.58
       Dwayne R. Johnson.  .  .  .  .  .  .  .     4,322    8.22
       WRITE-IN.  .  .  .  .  .  .  .  .  .  .       188     .29
';
$lines = explode("\n", $lines);

// Here, the U flag sets the ungreedy mode
$pattern = '/^\s*(\S.+\S)[. ]+([0-9]+)(?:,([0-9]+))?\s.*$/U';
echo "<pre>";
foreach ($lines  as $line) {
    // Here : - ${1} will capture the name,
    //        - ${2} the integer part of the number
    //        - ${3} the decimal part
    echo preg_replace($pattern, '${1},${2}${3}', $line) . "\n";
}
echo "</pre>";
?>

Result:

John David James (DEM),7808
Marvin D. Scott (DEM),6548
Maria "Mary" Williams (DEM),4551
Dwayne R. Johnson,4322
WRITE-IN,188

Upvotes: 1

Andreas

Reputation: 23958

This pattern captures the name by finding the dot sequence after the name.
Then captures a number and comma pattern as the number.

Then I loop to build the new array and replace comma with nothing.

$str = '       John David James (DEM) .  .  .  .  .  .     7,808   10.51
       Marvin D. Scott (DEM)  .  .  .  .  .  .     6,548    9.55
       Maria "Mary" Williams (DEM)  .  .  .  .     4,551    8.58
       Dwayne R. Johnson.  .  .  .  .  .  .  .     4,322    8.22
       WRITE-IN.  .  .  .  .  .  .  .  .  .  .       188     .29';
preg_match_all("/\s*(.*?)\s*\.  \..*?([\d,]+)/", $str, $matches);

foreach($matches[1] as $key => $name){
    $new[] = $name . "," . str_replace(",", "", $matches[2][$key]);
}


var_dump($new);

Output:

array(5) {
  [0]=>
  string(27) "John David James (DEM),7808"
  [1]=>
  string(26) "Marvin D. Scott (DEM),6548"
  [2]=>
  string(32) "Maria "Mary" Williams (DEM),4551"
  [3]=>
  string(22) "Dwayne R. Johnson,4322"
  [4]=>
  string(12) "WRITE-IN,188"
}

https://3v4l.org/SdqoZ

Upvotes: 1

Regex to parse line with and capture string and comma separated number

Answers (3)

Related Questions