Reputation: 23
I've stumbled across an interesting bug in PHP. Basically I have a regular expression seen below which works fine in one script (Script A) but fails to work when put into a class and used in a script (Script B).
I have tested this script on PHP 5.3, and 5.2.
Script A:
http://iamdb.googlecode.com/svn/trunk/testing.php
Script B:
Class the regex is used in: http://iamdb.googlecode.com/svn/trunk/imdb/search/imdb_search_title.class.php
Script calling it: http://iamdb.googlecode.com/svn/trunk/examples/Search_Debug.php
Regular Expression:
"#<br> aka <em>\"([^\"]*)\"</em>(?: -?,? ([^ ]*) (?:<em>\(([^\)]*)\)</em>)*)*#i"
Thanks.
As requested, here is some example output from Script B...
Array
(
[0] => Array
(
)
[1] => Array
(
)
[2] => Array
(
)
[3] => Array
(
)
[INPUT] => <small>(TV series)</small> <br>aka <em>"Hammer Time"</em> - USA <em>(working title)</em>
)
The numbered keys are from the preg_match_all call and the INPUT key is added afterwards to show the input string.
Upvotes: 0
Views: 196
Reputation: 75242
Are you trying to match against an actual search-result page on IMDB, like this one? On that page, the "<br>"
and the "aka"
are always separated by an entity reference for a non-breaking space:
<br> aka <em>
I don't know if it's always that way; you might want allow for multiple kinds and representations of whitepsace, like this:
<br>(?:&(?:#(?:160|xA0)|nbsp);|\xA0|\s)*+aka
i.e., zero or more of: an entity reference for an NBSP (decimal, hexadecimal or named); a real NBSP; or a standard whitespace character.
Upvotes: 0
Reputation: 124325
There's nothing wrong with the regex or embedding it in a class. You're convincing yourself that your test situations are equivalent when they're not. In the immediate case, the string you're sending the class version,
<small>(TV series)</small> <br>aka <em>"Hammer Time"</em> - USA <em>(working title)</em>
isn't matched by the regex because the regex requires exactly one space between the <br>
and the aka
. This revision of it works:
const REGEX_AKA = "#<br>\s*aka <em>\"([^\"]*)\"</em>(?: (?:-?)(?:,?) ([^ ]*) (?:<em>\(([^\)]*)\)</em>)*)*#i";
Upvotes: 1
Reputation: 761
Looking at the debugger, the subject of the preg_replace_all
s don't match between the class and the test.php
case.
From the test case:
<small>(TV series)</small> <br> aka <em>"Sledge Hammer: The Early Years"</em> - USA <em>(second season title)</em>
The actual subject when called from the class:
<small>(TV series)</small> <br>aka <em>"Hammer Time"</em> - USA <em>(working title)</em>
There's no space between the <br>
and the aka
. Take that space out of the regex and it works.
Upvotes: 2