Reputation: 93
I need to get all concerned images parsing a html in PHP, based on an expression formatted like this:
(fig. 8a-c, 9b-c)
I would like to catch this using a regex in order to output an array such as:
array(
[8] => [a,b,c],
[9] => [b,c])
The expression can be anything like:
(fig. 8)
(fig. 8,9)
(fig. 11a, b)
Here is the regex i have at the moment, but it does not seem to work for every case:
https://regex101.com/r/ShqlnY/3/
Can you help me getting an array containing all included images ? Thanks
Upvotes: 2
Views: 58
Reputation: 93
Thanks, i ended up with a regular expression like this:
'/(?:\(fig\.\h*|\G(?!^))(\d+)([a-z])?(?:-([a-z])?)?(?:,\h*)?(?=[^)]*\))/m'
used with preg_match_all
Upvotes: 1
Reputation: 163287
Perhaps for your example data you might use a range and a pattern with 3 capturing groups where the third group is optional.
If the third group does not exists, you return the single value in an array, or else you use the second and the third group to create a range.
(?:^\(fig\.\h*|\G(?!^))(\d+)([a-z])(?:-([a-z])?)?(?:,\h*)?(?=[^)]*\))
(?:
Non capturing group
^\(fig\.\h*
Match start of the string and (fig. followed by 0+ horizonal whitespaces|
Or\G(?!^)
Assert position at the end of the previous match, not at the start)
Close non capturing group(\d+)([a-z])
Capture 1+ digits in group 1, Capture a-z in group 2(?:
Non capturing group
-([a-z])?
)?
Close non capturing group and make optional(?:,\h*)?
Match optional ,
and 0+ horizontal whitespace chars(?=[^)]*\))
Assert what is on the right is a closing parenthesisFor example:
$pattern = "/(?:^\(fig\.\h*|\G(?!^))(\d+)([a-z])(?:-([a-z])?)?(?:,\h*)?(?=[^)]*\))/m";
$str = '(fig. 8a-c, 9b-c)
(fig. 8)
(fig. 8,9)
(fig. 11a, b)';
preg_match_all($pattern, $str, $matches, PREG_SET_ORDER | PREG_OFFSET_CAPTURE, 0);
$matches = array_map(function($x){
if (isset($x[3][0])) {
return [
$x[1][0] => range($x[2][0], $x[3][0]),
"start" => $x[1][1],
"end" => $x[3][1]
];
}
return [
$x[1][0] => [$x[2][1]],
"start" => $x[2][0],
"end" => $x[1][1]
];
}, $matches);
print_r($matches);
Result
Array
(
[0] => Array
(
[8] => Array
(
[0] => a
[1] => b
[2] => c
)
[start] => 6
[end] => 9
)
[1] => Array
(
[9] => Array
(
[0] => b
[1] => c
)
[start] => 12
[end] => 15
)
)
See a php demo
Upvotes: 0
Reputation: 626794
You may use
'~(?:\G(?!^),\s*|\(fig\.)\s*\K([0-9]{1,3})([a-z]-[a-z])~'
with preg_match_all
to get all the char ranges from inside a (fig. ...)
substring (see the regex demo), and then use this post-process code:
$rx = "~(?:\G(?!^),\s*|\(fig\.)\s*\K([0-9]{1,3})([a-z]-[a-z])~";
$s = "(fig. 8a-c, 9b-c)";
preg_match_all($rx, $s, $matches, PREG_OFFSET_CAPTURE | PREG_SET_ORDER, 0);
foreach ($matches as $m) {
$result = [];
$result[] = $m[0][1]; // Position of the match
$result[] = $m[1][0]; // The number
$kv = explode("-", $m[2][0]);
$result = array_merge($result, buildNumChain($kv));
print_r($result);
}
function buildNumChain($arr) {
$ret = [];
foreach(range($arr[0], $arr[1]) as $letter) {
$ret[] = $letter;
}
return $ret;
}
Output:
Array ( [0] => 6 [1] => 8 [2] => a [3] => b [4] => c )
Array ( [0] => 12 [1] => 9 [2] => b [3] => c )
See the PHP demo.
Regex details
(?:\G(?!^),\s*|\(fig\.)
- (fig.
or end of the previous match + ,
and 0+ whitespaces\s*
- 0+ whitespaces\K
- match reset operator([0-9]{1,3})
- Group 1: 1 to 3 digits([a-z]-[a-z])
- Group 2: a lowercase letter, -
and a lowercase letter.Upvotes: 0