Reputation: 434
This is driving me insane...
I have the following code:
# open pdf
$pdf = file_get_contents('myfile.pdf');
echo("RE 1:\n");
preg_match('/^[0-9]+ 0 obj.*\/Contents \[ ([0-9]+ [0-9]+) R \\]/msU', $pdf, $m);
var_dump($m);
echo("\nRE 2:\n");
preg_match('/^8 0 obj.*\/Contents \[ ([0-9]+ [0-9]+) R \\]/msU', $pdf, $m);
var_dump($m);
The file myfile.pdf contains the following text:
...
8 0 obj
<<
/Type /Page
/Parent 2 0 R
/Resources 6 0 R
/Contents [ 5 0 R ]
>>
endobj
...
The only difference between those two regular expressions is the numeric range at the beginning of the string. Yet I get the following output:
RE 1:
array(0) {
}
RE 2:
array(2) {
[0]=>
string(78) "8 0 obj
<<
/Type /Page
/Parent 2 0 R
/Resources 6 0 R
/Contents [ 5 0 R ]"
[1]=>
string(3) "5 0"
}
I would expect both regular expressions to return similar results, but the regular expression with the numeric range at the start (RE 1) doesn't return any results. Is this a bug or am I doing something wrong?
After adding preg_last_error()
, I am getting PREG_BACKTRACK_LIMIT_ERROR
. How can I fix that?
Upvotes: 1
Views: 471
Reputation: 27723
I'm guessing that you might be designing an expression that would somewhat look like,
[0-9]+\s+0\s+obj\b.*?\/Contents\s+\[\s*([0-9]+\s+[0-9]+)\s+R\s*\]
on s
mode.
$re = '/[0-9]+\s+0\s+obj\b.*?\/Contents\s+\[\s*([0-9]+\s+[0-9]+)\s+R\s*\]/s';
$str = '8 0 obj
<<
/Type /Page
/Parent 2 0 R
/Resources 6 0 R
/Contents [ 5 0 R ]
>>
endobj
8 0 obj
<<
/Type /Page
/Parent 2 0 R
/Resources 6 0 R
/Contents [ 5 0 R ]
>>
endobj';
preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);
var_dump($matches);
The expression is explained on the top right panel of regex101.com, if you wish to explore/simplify/modify it, and in this link, you can watch how it would match against some sample inputs, if you like.
jex.im visualizes regular expressions:
Upvotes: 1