Reputation: 62384
preg_match_all('/<h3>(.+?)<span>[a-z]*<\/span><\/h3>[\t\s+]*<ul class="cities">(.+?)<\/ul>[\t\s+]*<div class="clear"><\/div>/is',
Below is my attempt to make <h3>(.+?)<span>
ungreedy according to the cheatsheat:
preg_match_all('/<h3>(.+??)<span>[a-z]*<\/span><\/h3>[\t\s+]*<ul class="cities">(.+?)<\/ul>[\t\s+]*<div class="clear"><\/div>/is',
Cheatsheet: http://www.addedbytes.com/download/regular-expressions-cheat-sheet-v1/png/
I'm still looking into solutions, but any advice you have would be great
Currently it's showing
Array
(
[0] => Alabama
[1] => Arizona
[2] => Arkansas
[3] => California
[4] => Colorado
[5] => Connecticut
[6] => Delaware
[7] => District of Columbia<span>District of Columbia</span></h3>
<ul class="cities">
<li>
Upvotes: 0
Views: 2367
Reputation: 75232
As @MattBall pointed out, (.+?)
is already non-greedy. However, non-greedy quantifiers, like greedy quantifiers, will match as much as they have to in order to achieve an overall match. In your regex, the (.+?)
originally stops at the first <span>
tag, but then <span>[a-z]*<\/span>
can't match because (as @jswolf19 pointed out) "District of Columbia"
contains more than just letters.
When that happens, the regex engine backtracks. The (.+?)
consumes the <span>
tag and keeps going until it reaches the next <span>
tag, which apparently is in the entry for the next state. You could change the [a-z]*
to [a-z\s]*
so it will match spaces too, but I recommend you use [^<]*
("zero or more of any characters except <
") instead.
There are some other, minor problems with your regex as well. Here's how I would write it:
'~<h3>([^<]+)<span>([^<]*)</span></h3>\s*<ul class="cities">(.+?)</ul>\s*<div class="clear"></div>~is'
Explanation:
/
to ~
so you don't have to escape slashes in the regex.(.+?)
to ([^<]+)
makes it more reliable (it can never match too much, as it was doing before) and it expresses your intent more clearly ("grab everything up the beginning of the next tag"). [\t\s+]*
matches zero or more of any whitespace character OR a plus sign, which I don't think you meant. The +
loses its special meaning inside a character class, and \s
includes \t
, so that's redundant. I changed that to \s*
.Finally, please don't put your faith in that cheat sheet. It's just a random assemblage of elements drawn from various regex flavors, with several errors added. For example, it says \<
and \>
match word boundaries in one place, and elsewhere it says you have to escape <
and >
with backslashes to match them literally. Not only do those two items contradict each other, neither of them applies to PHP (or to most other flavors, for that matter).
You'll be much better off relying on flavor-specific resources like the PHP docs and the PCRE manpage. There's also a much more reliable, flavor-neutral resource located here.
Upvotes: 2
Reputation: 25489
That would be .*?
, instead of .+?
, I believe
This would make your regex
/<h3>(.*?)<span>[a-z\s]*<\/span><\/h3>[\t\s+]*<ul class="cities">(.+?)<\/ul>[\t\s+]*<div class="clear"><\/div>/is
Note that I added a whitespace character matching to the regex in the <span>
tag
Upvotes: 0
Reputation: 31579
You try to make the wrong parts ungreedy, try this:
/<h3>(.+?)<span>[a-z]*?<\/span><\/h3>[\t\s+]*?<ul class="cities">(.+?)<\/ul>[\t\s+]*?<div class="clear"><\/div>/is
Upvotes: 0
Reputation: 2303
Your problem is with <span>[a-z]*<\/span><\/h3>
. There are spaces in "District of Colombia" that you're not matching.
Upvotes: 1