Herfox
Herfox

Reputation: 137

Php regular expressions work different on different servers

I am using regex to get URL's from a webpage.

On localhost (PHP 5.3.15 with Suhosin-Patch (cli) (built: Aug 24 2012 17:45:44)) code:

$file = file_get_contents("http://www.etech.haw-hamburg.de/Stundenplan/");
$pattern = "/<a href=\"([^\"]*.pdf)\">(.*)<\/a>/iU";
preg_match_all($pattern, $file, $matches);
echo "<pre>";
print_r($matches);
echo "</pre>";

gives:

=> Array
(
        [0] => Sem_IuE_E1a.pdf
        [1] => Sem_IuE_E2a.pdf
        [2] => Sem_IuE_E3a.pdf
        [3] => Sem_IuE_E4a.pdf
        [4] => Sem_IuE_E6AT.pdf
        [5] => Sem_IuE_E7.pdf
        [6] => Sem_IuE_E1b.pdf
        [7] => Sem_IuE_E2b.pdf
        [8] => Sem_IuE_E3b.pdf
        [9] => Sem_IuE_E4b.pdf
        [10] => Sem_IuE_E6II.pdf
        [11] => Sem_IuE_E6KT.pdf
        [12] => Sem_IuE_BMT1.pdf
        [13] => Laborplan%20BMT1%20KoP%201.pdf
        [14] => Sem_IuE_BMT2.pdf
        [15] => Sem_IuE_BMT3.pdf
        [16] => Sem_IuE_BMT4.pdf
        [17] => Sem_IuE_BMT5.pdf
        [18] => Sem_IuE_BMT6.pdf
        [19] => Sem_IuE_IE2.pdf
        [20] => Sem_IuE_IE4.pdf
        [21] => Sem_IuE_IE6.pdf
        [22] => Sem_IuE_AM.pdf
        [23] => Sem_IuE_IKM1.pdf
        [24] => Legende_Stud.pdf
        [25] => Kalender.pdf
        [26] => Doz.pdf
        [27] => Doz.pdf
    )

while, on the remote server (PHP 5.3.3 (cli) (built: Feb 22 2013 02:51:11)) the same code gives:

=> Array
    (
        [0] => Sem_IuE_E2a.pdf
        [1] => Sem_IuE_E7.pdf
        [2] => Sem_IuE_E1b.pdf
        [3] => Sem_IuE_E2b.pdf
        [4] => Sem_IuE_E3b.pdf
        [5] => Sem_IuE_E6II.pdf
        [6] => Sem_IuE_E6KT.pdf
        [7] => Sem_IuE_BMT1.pdf
        [8] => Laborplan%20BMT1%20KoP%201.pdf
        [9] => Sem_IuE_BMT2.pdf
        [10] => Sem_IuE_BMT3.pdf
        [11] => Sem_IuE_BMT4.pdf
        [12] => Sem_IuE_BMT5.pdf
        [13] => Sem_IuE_BMT6.pdf
        [14] => Sem_IuE_IE2.pdf
        [15] => Sem_IuE_IE4.pdf
        [16] => Sem_IuE_IE6.pdf
        [17] => Sem_IuE_AM.pdf
        [18] => Doz.pdf
        [19] => Doz.pdf
    )

What is the problem?

Upvotes: 1

Views: 463

Answers (2)

AbsoluteƵER&#216;
AbsoluteƵER&#216;

Reputation: 7880

I've come up with a work-around. If you open the page, strip the tags, then parse you should get more consistent answers. Code from Microsoft apps (target page) is horrible.

<?php
$file = file_get_contents("http://www.etech.haw-hamburg.de/Stundenplan/");
$file = strip_tags($file,'<a>');
$pattern = "!\<a href=[\"|']([^.]+\.pdf)[\"|']\>([^\<]+)\<\/a\>!iU";
preg_match_all($pattern, $file, $matches);
echo "<pre>";
print_r($matches);
echo "</pre>";
?>

Upvotes: 1

Tchoupi
Tchoupi

Reputation: 14691

I have no precise answer. But in your question you mention that you have different results by using PHP 5.3.3 and PHP 5.3.15.

I took a look at PHP5 ChangeLog, where the answer probably lies, and saw the following possible explanations.

Upgraded bundled PCRE to version 8.11. (Ilia)

Upgraded bundled PCRE to version 8.12. (Scott)

I read the release notes for both PCRE versions, and I am not sure what could affect matching in your case, except for a few corrections mentioning UTF8 encoding.

But, while looking at U modifier I noticed in PCRE Configuration Options that:

PCRE's backtracking limit. Defaults to 100000 for PHP < 5.3.7.

My guess is that some fix in the U (PCRE_UNGREEDY) modifier changed the way that the part between the <a> is matched. This makes sense, because by looking at the source of the page you are scraping, the only one that matches in the earlier PHP version are the <a> tags that don't contain inner HTML.

Example, this one matches:

<a href="Sem_IuE_E2a.pdf">E2a</a>

This one doesn't:

<a href="Sem_IuE_E4a.pdf"><span lang=IT style='mso-ansi-language:IT'>E4a</span></a>

Very interesting, but how to fix it?

I don't have access to an earlier PHP version so I cannot test it, but I would say remove the greedy part of your regular expression, because you don't need to match the part inside the <a></a> tags, since the value is already contained in the PDF filename:

$pattern = "/<a href=\"([^\"]*.pdf)\">/i";

Or

Use a DOM Parser.

Upvotes: 1

Related Questions