Reputation: 137
I am using regex to get URL's from a webpage.
On localhost (PHP 5.3.15 with Suhosin-Patch (cli) (built: Aug 24 2012 17:45:44)) code:
$file = file_get_contents("http://www.etech.haw-hamburg.de/Stundenplan/");
$pattern = "/<a href=\"([^\"]*.pdf)\">(.*)<\/a>/iU";
preg_match_all($pattern, $file, $matches);
echo "<pre>";
print_r($matches);
echo "</pre>";
gives:
=> Array
(
[0] => Sem_IuE_E1a.pdf
[1] => Sem_IuE_E2a.pdf
[2] => Sem_IuE_E3a.pdf
[3] => Sem_IuE_E4a.pdf
[4] => Sem_IuE_E6AT.pdf
[5] => Sem_IuE_E7.pdf
[6] => Sem_IuE_E1b.pdf
[7] => Sem_IuE_E2b.pdf
[8] => Sem_IuE_E3b.pdf
[9] => Sem_IuE_E4b.pdf
[10] => Sem_IuE_E6II.pdf
[11] => Sem_IuE_E6KT.pdf
[12] => Sem_IuE_BMT1.pdf
[13] => Laborplan%20BMT1%20KoP%201.pdf
[14] => Sem_IuE_BMT2.pdf
[15] => Sem_IuE_BMT3.pdf
[16] => Sem_IuE_BMT4.pdf
[17] => Sem_IuE_BMT5.pdf
[18] => Sem_IuE_BMT6.pdf
[19] => Sem_IuE_IE2.pdf
[20] => Sem_IuE_IE4.pdf
[21] => Sem_IuE_IE6.pdf
[22] => Sem_IuE_AM.pdf
[23] => Sem_IuE_IKM1.pdf
[24] => Legende_Stud.pdf
[25] => Kalender.pdf
[26] => Doz.pdf
[27] => Doz.pdf
)
while, on the remote server (PHP 5.3.3 (cli) (built: Feb 22 2013 02:51:11)) the same code gives:
=> Array
(
[0] => Sem_IuE_E2a.pdf
[1] => Sem_IuE_E7.pdf
[2] => Sem_IuE_E1b.pdf
[3] => Sem_IuE_E2b.pdf
[4] => Sem_IuE_E3b.pdf
[5] => Sem_IuE_E6II.pdf
[6] => Sem_IuE_E6KT.pdf
[7] => Sem_IuE_BMT1.pdf
[8] => Laborplan%20BMT1%20KoP%201.pdf
[9] => Sem_IuE_BMT2.pdf
[10] => Sem_IuE_BMT3.pdf
[11] => Sem_IuE_BMT4.pdf
[12] => Sem_IuE_BMT5.pdf
[13] => Sem_IuE_BMT6.pdf
[14] => Sem_IuE_IE2.pdf
[15] => Sem_IuE_IE4.pdf
[16] => Sem_IuE_IE6.pdf
[17] => Sem_IuE_AM.pdf
[18] => Doz.pdf
[19] => Doz.pdf
)
What is the problem?
Upvotes: 1
Views: 463
Reputation: 7880
I've come up with a work-around. If you open the page, strip the tags, then parse you should get more consistent answers. Code from Microsoft apps (target page) is horrible.
<?php
$file = file_get_contents("http://www.etech.haw-hamburg.de/Stundenplan/");
$file = strip_tags($file,'<a>');
$pattern = "!\<a href=[\"|']([^.]+\.pdf)[\"|']\>([^\<]+)\<\/a\>!iU";
preg_match_all($pattern, $file, $matches);
echo "<pre>";
print_r($matches);
echo "</pre>";
?>
Upvotes: 1
Reputation: 14691
I have no precise answer. But in your question you mention that you have different results by using PHP 5.3.3 and PHP 5.3.15.
I took a look at PHP5 ChangeLog, where the answer probably lies, and saw the following possible explanations.
Upgraded bundled PCRE to version 8.11. (Ilia)
Upgraded bundled PCRE to version 8.12. (Scott)
I read the release notes for both PCRE versions, and I am not sure what could affect matching in your case, except for a few corrections mentioning UTF8 encoding.
But, while looking at U
modifier I noticed in PCRE Configuration Options that:
PCRE's backtracking limit. Defaults to 100000 for PHP < 5.3.7.
My guess is that some fix in the U
(PCRE_UNGREEDY) modifier changed the way that the part between the <a>
is matched. This makes sense, because by looking at the source of the page you are scraping, the only one that matches in the earlier PHP version are the <a>
tags that don't contain inner HTML.
Example, this one matches:
<a href="Sem_IuE_E2a.pdf">E2a</a>
This one doesn't:
<a href="Sem_IuE_E4a.pdf"><span lang=IT style='mso-ansi-language:IT'>E4a</span></a>
Very interesting, but how to fix it?
I don't have access to an earlier PHP version so I cannot test it, but I would say remove the greedy part of your regular expression, because you don't need to match the part inside the <a></a>
tags, since the value is already contained in the PDF filename:
$pattern = "/<a href=\"([^\"]*.pdf)\">/i";
Or
Use a DOM Parser.
Upvotes: 1