Reputation: 903
I have to create a loop, and with a regexp populate any of the 4 variables
$address, $street, $town, $lot
The loop will be fed a string that may have info in it like the lines below
'123 any street, mytown'
or'Lot 4 another road, thattown'
or'Lot 2 96 other road, her town'
or'this ave, this town'
or'yourtown'
since anything after a comma is the $town
I thought
(.*), (.*)
then the first capture could be checked with (Lot \d*) (.*), (.*)
if the 1st capture starts with a number, then its the address (if word with white space its $street
)
if one word, its just the $town
Upvotes: 1
Views: 1601
Reputation: 515
Geo::StreetAddress::US is fine for simple addresses, but it can lose context on harder examples. It will parse street names up until it finds a suburb. So with " 46 7th St. Johns Park", 'St.' is consumed too soon, street type get incorrectly assigned to 'Park' and the stae of 'CA' becomes the suburb.
2 Smith St Suburb NJ 12345 2 Smith St Suburb NJ 12345
25 MIRROR LAKE DR LITTLE EGG HARBOR 25 MIRROR LAKE DR Hbr NJ 0
74B Old Bohema Rd N, St. Johns Park 74 B Old Bohema Rd St Johns Park CA 95472
74 Mt Baw Baw Rd Suite C Some Park C 74 Mt Baw Baw Rd S Park CA 0
74 Old Bohema Rd Bldg A Some Park CA 74 Old Bohema Rd B Park CA 0
74 Old Bohema Rd Rm 123A Some Park C 74 Old Bohema Rd R Park CA 0
Lot 74 Old Bohema Rd Some Park CA 95 0 Old Bohema Rd S Park CA 0
22 Glen Alpine Way Some Park CA 9547 22 Glen Alpine Way Park CA 0
4/6 Bohema Rd, St. Johns Park CA 954 4 6 Bohema Rd St Johns Park CA 95472
46 The Parade, St. Johns Park CA 954 46 The Parade 0
46 7th St. Johns Park CA 95472 46 7th St Johns Park CA 0
46 B Avenue Johns Park CA 95472 46 B Avenue Johns Park CA 0
46 Avenue C Johns Park CA 95472 46 Avenue C Johns Park CA 0
46 Broadway Johns Park CA 95472 46 Broadway Johns Park CA 0
46 State Route 19 Johns Park CA 9547 46 State Route 19 Park CA 0
46 John F Kennedy Drive Johns Park C 46 John F Kennedy Park CA 0
PO Box 213 Somewhere IO 1234 0 Somewhere IO 0
1 BEACH DR SE # 2410 ST PETERSBURG F 1 BEACH DR SE # 2 St PETERSBURG FL 33701
# 123 12 BEACH DR SE ST PETERSBURG F 12 BEACH DR SE St PETERSBURG FL 33701
46 Broad Street #12 Suburb CA 95472 46 Broad St 0
I have developed a Perl module that can identify many of these more difficult patterns https://metacpan.org/release/Lingua-EN-AddressParse . It recognizes idioms such as 'The Parade", nth Street, sub property addresses such as "46 Broad Street #12" and many more.
Upvotes: 0
Reputation: 118128
Take a look at Geo::StreetAddress::US if these are U.S. addresses.
Even if they are not, the source of this module should give you an idea of what is involved in parsing free form street addresses.
Here is a script that handles the addresses you posted (updated, earlier version combined lot and number into one string):
#!/usr/bin/perl
use strict; use warnings;
local $/ = "";
my @addresses;
while ( my $address = <DATA> ) {
chomp $address;
$address =~ s/\s+/ /g;
my (%address, $rest);
($address{town}, $rest) = map { scalar reverse }
split( / ?, ?/, reverse($address), 2 );
{
no warnings 'uninitialized';
@address{qw(lot number street)} =
$rest =~ /^(?:(Lot [0-9]) )?(?:([0-9]+) )?(.+)\z/;
}
push @addresses, \%address;
}
use Data::Dumper;
print Dumper \@addresses;
__DATA__
123 any street,
mytown
Lot 4 another road,
thattown
Lot 2 96 other road,
her town
yourtown
street,
town
Output:
$VAR1 = [ { 'lot' => undef, 'number' => '123', 'street' => 'any street', 'town' => 'mytown' }, { 'lot' => 'Lot 4', 'number' => undef, 'street' => 'another road', 'town' => 'thattown' }, { 'lot' => 'Lot 2', 'number' => '96', 'street' => 'other road', 'town' => 'her town' }, { 'lot' => undef, 'number' => undef, 'street' => undef, 'town' => 'yourtown' }, { 'lot' => undef, 'number' => undef, 'street' => 'street', 'town' => 'town' } ];
Upvotes: 7
Reputation: 304147
This should separate into 3 parts - how do you distinguish the address/street?
(Lot \d*)? ?([^,]*,)? ?(.*)
here is the breakdown for your examples
('', '123 any street,', 'mytown')
('Lot 4', 'another road,', 'thattown')
('Lot 2', '96 other road,', 'her town')
('', 'this ave,', 'this town')
('', '', 'yourtown')
If I understand correctly, this one separates the address/street as well
(Lot \d*)? ?(\d*) ?([^,]*,)? ?(.*)
('', '123', 'any street,', 'mytown')
('Lot 4', '', 'another road,', 'thattown')
('Lot 2', '96', 'other road,', 'her town')
('', '', 'this ave,', 'this town')
('', '', '', 'yourtown')
Upvotes: 1
Reputation: 10340
I can't match the last one but for the first 3 ones you can use something like this:
if (preg_match('/(?:Lot (\d*)|)(?: |)(?:(\d*)|) (.*), (.*)/m', $subject, $regs)) {
$result = $regs[1];
} else {
$result = "";
}
this is the testing regex:
(?:Lot (\d*)|)(?: |)(?:(\d*)|) (.*), (.*)
You can use this in regexbuddy to test: link
Upvotes: 0
Reputation: 3891
I'd suggest you don't try to do all of this in a single regexp as it will be hard to verify its correctness.
First, I'd split at the comma. Whatever comes after the comma is the $town, and if there is no comma, the whole string is the $town.
Then I'd check if there is any lot information and extract it from the string.
Then I'd look for street/avenue number and name.
Divide and conquer :)
Upvotes: 7