Reputation: 761
I am trying to parse a file with lines similar to:
John David James (DEM) . . . . . . 7,808 10.51
Marvin D. Scott (DEM) . . . . . . 6,548 9.55
Maria "Mary" Williams (DEM) . . . . 4,551 8.58
Dwayne R. Johnson. . . . . . . . 4,322 8.22
WRITE-IN. . . . . . . . . . . 188 .29
I need to capture the name and the number in the first column. The end result would be
John David James (DEM),7808
Marvin D. Scott (DEM),6548
Maria "Mary" Williams (DEM),4551
Dwayne R. Johnson,4322
WRITE-IN,188
I've tried
\s*\b(.*)\b(\s*\.\s*.*)(\d+,\d+|\d+)\b
\s*\b(.*)\b(\.|.\s)+\b(\d+,\d+|\d+)\b
Any suggestions?
Upvotes: 1
Views: 92
Reputation: 272106
If the data is column aligned (all columns have known, fixed width) then use string functions such as substr
:
<?php
$lines = '
John David James (DEM) . . . . . . 7,808 10.51
Marvin D. Scott (DEM) . . . . . . 6,548 9.55
Maria "Mary" Williams (DEM) . . . . 4,551 8.58
Dwayne R. Johnson. . . . . . . . 4,322 8.22
WRITE-IN. . . . . . . . . . . 188 .29
';
foreach(preg_split('/(\\r|\\n)+/', $lines) as $line) {
if ($line === '') continue;
$name = substr($line, 0, 46);
$amount = substr($line, 46, 10);
$name = rtrim(ltrim($name), " .");
$amount = (float) str_replace(",", "", $amount);
echo $name . ", " . $amount;
}
Upvotes: 1
Reputation: 6384
You can achieve it with an UNGREEDY regexp.
Here, when we catch the name, we want "a sequence of any character followed by a sequence of dots and spaces". So here is the equivalent regexp: (.+)[. ]*
.
But the engine is set in greedy mode default. What will happen? The first part (.+)
won't stop at the first dot or the first space encountered. Why? Because it is possible to perform the whole regular expression to the end of the line, and the engine will take this path as it is in greedy mode.
Same goes with the whole regexp you can see in the working code below. The first capturing group will capture beyond the name field.
We need to tell him to "eat" the less matchable part.
<?php
$lines = '
John David James (DEM) . . . . . . 7,808 10.51
Marvin D. Scott (DEM) . . . . . . 6,548 9.55
Maria "Mary" Williams (DEM) . . . . 4,551 8.58
Dwayne R. Johnson. . . . . . . . 4,322 8.22
WRITE-IN. . . . . . . . . . . 188 .29
';
$lines = explode("\n", $lines);
// Here, the U flag sets the ungreedy mode
$pattern = '/^\s*(\S.+\S)[. ]+([0-9]+)(?:,([0-9]+))?\s.*$/U';
echo "<pre>";
foreach ($lines as $line) {
// Here : - ${1} will capture the name,
// - ${2} the integer part of the number
// - ${3} the decimal part
echo preg_replace($pattern, '${1},${2}${3}', $line) . "\n";
}
echo "</pre>";
?>
Result:
John David James (DEM),7808
Marvin D. Scott (DEM),6548
Maria "Mary" Williams (DEM),4551
Dwayne R. Johnson,4322
WRITE-IN,188
Upvotes: 1
Reputation: 23958
This pattern captures the name by finding the dot sequence after the name.
Then captures a number and comma pattern as the number.
Then I loop to build the new array and replace comma with nothing.
$str = ' John David James (DEM) . . . . . . 7,808 10.51
Marvin D. Scott (DEM) . . . . . . 6,548 9.55
Maria "Mary" Williams (DEM) . . . . 4,551 8.58
Dwayne R. Johnson. . . . . . . . 4,322 8.22
WRITE-IN. . . . . . . . . . . 188 .29';
preg_match_all("/\s*(.*?)\s*\. \..*?([\d,]+)/", $str, $matches);
foreach($matches[1] as $key => $name){
$new[] = $name . "," . str_replace(",", "", $matches[2][$key]);
}
var_dump($new);
Output:
array(5) {
[0]=>
string(27) "John David James (DEM),7808"
[1]=>
string(26) "Marvin D. Scott (DEM),6548"
[2]=>
string(32) "Maria "Mary" Williams (DEM),4551"
[3]=>
string(22) "Dwayne R. Johnson,4322"
[4]=>
string(12) "WRITE-IN,188"
}
Upvotes: 1