Ben
Ben

Reputation: 62384

Making this regex not greedy

preg_match_all('/<h3>(.+?)<span>[a-z]*<\/span><\/h3>[\t\s+]*<ul class="cities">(.+?)<\/ul>[\t\s+]*<div class="clear"><\/div>/is',

Below is my attempt to make <h3>(.+?)<span> ungreedy according to the cheatsheat:

preg_match_all('/<h3>(.+??)<span>[a-z]*<\/span><\/h3>[\t\s+]*<ul class="cities">(.+?)<\/ul>[\t\s+]*<div class="clear"><\/div>/is',

Cheatsheet: http://www.addedbytes.com/download/regular-expressions-cheat-sheet-v1/png/

I'm still looking into solutions, but any advice you have would be great

Currently it's showing

Array
(
    [0] => Alabama
    [1] => Arizona
    [2] => Arkansas
    [3] => California
    [4] => Colorado
    [5] => Connecticut
    [6] => Delaware
    [7] => District of Columbia<span>District of Columbia</span></h3>
                      <ul class="cities">
                        <li>

Upvotes: 0

Views: 2367

Answers (4)

Alan Moore
Alan Moore

Reputation: 75232

As @MattBall pointed out, (.+?) is already non-greedy. However, non-greedy quantifiers, like greedy quantifiers, will match as much as they have to in order to achieve an overall match. In your regex, the (.+?) originally stops at the first <span> tag, but then <span>[a-z]*<\/span> can't match because (as @jswolf19 pointed out) "District of Columbia" contains more than just letters.

When that happens, the regex engine backtracks. The (.+?) consumes the <span> tag and keeps going until it reaches the next <span> tag, which apparently is in the entry for the next state. You could change the [a-z]* to [a-z\s]* so it will match spaces too, but I recommend you use [^<]* ("zero or more of any characters except <") instead.

There are some other, minor problems with your regex as well. Here's how I would write it:

'~<h3>([^<]+)<span>([^<]*)</span></h3>\s*<ul class="cities">(.+?)</ul>\s*<div class="clear"></div>~is'

Explanation:

  • I changed the regex delimiter from / to ~ so you don't have to escape slashes in the regex.
  • Changing the first (.+?) to ([^<]+) makes it more reliable (it can never match too much, as it was doing before) and it expresses your intent more clearly ("grab everything up the beginning of the next tag").
  • Your [\t\s+]* matches zero or more of any whitespace character OR a plus sign, which I don't think you meant. The + loses its special meaning inside a character class, and \s includes \t, so that's redundant. I changed that to \s*.

Finally, please don't put your faith in that cheat sheet. It's just a random assemblage of elements drawn from various regex flavors, with several errors added. For example, it says \< and \> match word boundaries in one place, and elsewhere it says you have to escape < and > with backslashes to match them literally. Not only do those two items contradict each other, neither of them applies to PHP (or to most other flavors, for that matter).

You'll be much better off relying on flavor-specific resources like the PHP docs and the PCRE manpage. There's also a much more reliable, flavor-neutral resource located here.

Upvotes: 2

pho
pho

Reputation: 25489

That would be .*?, instead of .+?, I believe

This would make your regex

/<h3>(.*?)<span>[a-z\s]*<\/span><\/h3>[\t\s+]*<ul class="cities">(.+?)<\/ul>[\t\s+]*<div class="clear"><\/div>/is

Note that I added a whitespace character matching to the regex in the <span> tag

Upvotes: 0

Daniel
Daniel

Reputation: 31579

You try to make the wrong parts ungreedy, try this:

/<h3>(.+?)<span>[a-z]*?<\/span><\/h3>[\t\s+]*?<ul class="cities">(.+?)<\/ul>[\t\s+]*?<div class="clear"><\/div>/is

Upvotes: 0

jswolf19
jswolf19

Reputation: 2303

Your problem is with <span>[a-z]*<\/span><\/h3>. There are spaces in "District of Colombia" that you're not matching.

Upvotes: 1

Related Questions