Reputation: 1787
I am trying to extract the City element of from string having the following format:
<BR>Address 1<BR>Address 2<BR>City<BR>A1A 0A0<BR>Phone Number <BR>
OR
<BR>Address 1<BR>Address 2<BR>Address 3<BR>City<BR>A1A 0A0<BR>Phone Number <BR>
The input string can have a random number of Address item before the city.
So far, my strategy is to select the postal code (A1A 0A0) and then extract the previous record using <BR>
as marker.
So far I am using
<BR>(.*)<BR>[ABCEFGHJKLMNPRSTVXY][0-9][ABCEFGHJKLMNPRSTVWXYZ] [0-9][ABCEFGHJKLMNPRSTVWXYZ][0-9]
$1
Where $1 return the first group of regex in the tool I am using (visual web ripper). However the expression returns everything before the postal code.
So is there a way to make a regex non greedy to select the previous occurence?
Upvotes: 1
Views: 60
Reputation: 539
Took me a bit to get it but here:
[^>]*<BR>[ABCEFGHJKLMNPRSTVXY][0-9][ABCEFGHJKLMNPRSTVWXYZ] [0-9][ABCEFGHJKLMNPRSTVWXYZ][0-9]
Edit: If you want to add capturing or non-capturing group you can do the following:
Non-capturing for the
and Postal Code:
[^>]*(?:<BR>[ABCEFGHJKLMNPRSTVXY][0-9][ABCEFGHJKLMNPRSTVWXYZ] [0-9][ABCEFGHJKLMNPRSTVWXYZ][0-9])
Capturing for just the city:
([^>]*)<BR>[ABCEFGHJKLMNPRSTVXY][0-9][ABCEFGHJKLMNPRSTVWXYZ] [0-9][ABCEFGHJKLMNPRSTVWXYZ][0-9]
Edit 2:
As per comments below: Will only work if the name of the city does not contain the ">" character
Upvotes: 1
Reputation: 20486
So bear with me on this one, but this is how I got it to work:
(?:<BR>(.*?))+<BR>[ABCEFGHJKLMNPRSTVXY][0-9][ABCEFGHJKLMNPRSTVWXYZ] [0-9][ABCEFGHJKLMNPRSTVWXYZ][0-9]
Explanation:
(?: # Start a non-capturing group (so that we don't have unnecessary matches)
<BR> # Look for a <BR> to start the group
(.*?) # Then lazily match 0+ characters (lazy will stop us at the next match)
)+ # End the group and repeat it 1+ times (each field)
<BR> # Look for one final <BR> right before the Zip Code
[...] # I didn't feel like including the Zip Code logic you wrote :)
However, depending on your language, I would recommend splitting the string and looping through it. Example in PHP:
$pieces = explode('<BR>', '<BR>Address 1<BR>Address 2<BR>Address 3<BR>City<BR>A1A 0A0<BR>Phone Number<BR>');
$count = count($pieces);
$city = null;
for($i = 1; $i < $count; $i++) {
if(preg_match('/[ABCEFGHJKLMNPRSTVXY][0-9][ABCEFGHJKLMNPRSTVWXYZ] [0-9][ABCEFGHJKLMNPRSTVWXYZ][0-9]/', $pieces[$i])) {
$city = $pieces[$i - 1];
break;
}
}
var_dump($city);
// string(4) "City"
Upvotes: 2