magdmartin
magdmartin

Reputation: 1787

Regex select previous occurence

I am trying to extract the City element of from string having the following format:

<BR>Address 1<BR>Address 2<BR>City<BR>A1A 0A0<BR>Phone Number <BR>

OR

<BR>Address 1<BR>Address 2<BR>Address 3<BR>City<BR>A1A 0A0<BR>Phone Number <BR>

The input string can have a random number of Address item before the city.

So far, my strategy is to select the postal code (A1A 0A0) and then extract the previous record using <BR> as marker.

So far I am using

<BR>(.*)<BR>[ABCEFGHJKLMNPRSTVXY][0-9][ABCEFGHJKLMNPRSTVWXYZ] [0-9][ABCEFGHJKLMNPRSTVWXYZ][0-9]
$1

Where $1 return the first group of regex in the tool I am using (visual web ripper). However the expression returns everything before the postal code.

So is there a way to make a regex non greedy to select the previous occurence?

Upvotes: 1

Views: 60

Answers (2)

IronWilliamCash
IronWilliamCash

Reputation: 539

Took me a bit to get it but here:

[^>]*<BR>[ABCEFGHJKLMNPRSTVXY][0-9][ABCEFGHJKLMNPRSTVWXYZ] [0-9][ABCEFGHJKLMNPRSTVWXYZ][0-9]

Edit: If you want to add capturing or non-capturing group you can do the following:

Non-capturing for the
and Postal Code:

[^>]*(?:<BR>[ABCEFGHJKLMNPRSTVXY][0-9][ABCEFGHJKLMNPRSTVWXYZ] [0-9][ABCEFGHJKLMNPRSTVWXYZ][0-9])

Capturing for just the city:

([^>]*)<BR>[ABCEFGHJKLMNPRSTVXY][0-9][ABCEFGHJKLMNPRSTVWXYZ] [0-9][ABCEFGHJKLMNPRSTVWXYZ][0-9]

Edit 2:

As per comments below: Will only work if the name of the city does not contain the ">" character

Upvotes: 1

Sam
Sam

Reputation: 20486

So bear with me on this one, but this is how I got it to work:

(?:<BR>(.*?))+<BR>[ABCEFGHJKLMNPRSTVXY][0-9][ABCEFGHJKLMNPRSTVWXYZ] [0-9][ABCEFGHJKLMNPRSTVWXYZ][0-9]

Explanation:

(?:       # Start a non-capturing group (so that we don't have unnecessary matches)
  <BR>    # Look for a <BR> to start the group
  (.*?)   # Then lazily match 0+ characters (lazy will stop us at the next match)
)+        # End the group and repeat it 1+ times (each field)
<BR>      # Look for one final <BR> right before the Zip Code
[...]     # I didn't feel like including the Zip Code logic you wrote :)

However, depending on your language, I would recommend splitting the string and looping through it. Example in PHP:

$pieces = explode('<BR>', '<BR>Address 1<BR>Address 2<BR>Address 3<BR>City<BR>A1A 0A0<BR>Phone Number<BR>');
$count = count($pieces);

$city = null;
for($i = 1; $i < $count; $i++) {
    if(preg_match('/[ABCEFGHJKLMNPRSTVXY][0-9][ABCEFGHJKLMNPRSTVWXYZ] [0-9][ABCEFGHJKLMNPRSTVWXYZ][0-9]/', $pieces[$i])) {
        $city = $pieces[$i - 1];
        break;
    }
}

var_dump($city);
// string(4) "City"

Upvotes: 2

Related Questions