Sanju
Sanju

Reputation: 903

Parse address with regex

I have to create a loop, and with a regexp populate any of the 4 variables

$address, $street, $town, $lot

The loop will be fed a string that may have info in it like the lines below

since anything after a comma is the $town I thought

(.*), (.*)

then the first capture could be checked with (Lot \d*) (.*), (.*) if the 1st capture starts with a number, then its the address (if word with white space its $street) if one word, its just the $town

Upvotes: 1

Views: 1601

Answers (5)

Kim Ryan
Kim Ryan

Reputation: 515

Geo::StreetAddress::US is fine for simple addresses, but it can lose context on harder examples. It will parse street names up until it finds a suburb. So with " 46 7th St. Johns Park", 'St.' is consumed too soon, street type get incorrectly assigned to 'Park' and the stae of 'CA' becomes the suburb.

2 Smith St Suburb NJ 12345              2 Smith           St   Suburb          NJ 12345
25 MIRROR LAKE DR LITTLE EGG HARBOR    25 MIRROR LAKE DR  Hbr  NJ                     0
74B Old Bohema Rd N, St. Johns Park    74 B Old Bohema    Rd   St Johns Park   CA 95472
74 Mt Baw Baw Rd Suite C Some Park C   74 Mt Baw Baw Rd S Park CA                     0
74 Old Bohema Rd Bldg A Some Park CA   74 Old Bohema Rd B Park CA                     0
74 Old Bohema Rd Rm 123A Some Park C   74 Old Bohema Rd R Park CA                     0
Lot 74 Old Bohema Rd Some Park CA 95    0 Old Bohema Rd S Park CA                     0
22 Glen Alpine Way Some Park CA 9547   22 Glen Alpine Way Park CA                     0
4/6 Bohema Rd, St. Johns Park CA 954    4 6 Bohema        Rd   St Johns Park   CA 95472
46 The Parade, St. Johns Park CA 954   46 The                  Parade                 0
46 7th St. Johns Park CA 95472         46 7th St Johns    Park CA                     0
46 B Avenue Johns Park CA 95472        46 B Avenue Johns  Park CA                     0
46 Avenue C Johns Park CA 95472        46 Avenue C Johns  Park CA                     0
46 Broadway Johns Park CA 95472        46 Broadway Johns  Park CA                     0
46 State Route 19 Johns Park CA 9547   46 State Route 19  Park CA                     0
46 John F Kennedy Drive Johns Park C   46 John F Kennedy  Park CA                     0
PO Box 213 Somewhere IO 1234            0 Somewhere            IO                     0
1 BEACH DR SE # 2410 ST PETERSBURG F    1 BEACH DR SE # 2 St   PETERSBURG      FL 33701
# 123 12 BEACH DR SE ST PETERSBURG F   12 BEACH DR SE     St   PETERSBURG      FL 33701
46 Broad Street #12 Suburb CA 95472    46 Broad           St                          0

I have developed a Perl module that can identify many of these more difficult patterns https://metacpan.org/release/Lingua-EN-AddressParse . It recognizes idioms such as 'The Parade", nth Street, sub property addresses such as "46 Broad Street #12" and many more.

Upvotes: 0

Sinan Ünür
Sinan Ünür

Reputation: 118128

Take a look at Geo::StreetAddress::US if these are U.S. addresses.

Even if they are not, the source of this module should give you an idea of what is involved in parsing free form street addresses.

Here is a script that handles the addresses you posted (updated, earlier version combined lot and number into one string):

#!/usr/bin/perl

use strict; use warnings;

local $/ = "";

my @addresses;

while ( my $address = <DATA> ) {
    chomp $address;
    $address =~ s/\s+/ /g;
    my (%address, $rest);
    ($address{town}, $rest) = map { scalar reverse }
                        split( / ?, ?/, reverse($address), 2 );

    {
        no warnings 'uninitialized';
        @address{qw(lot number street)} =
            $rest =~ /^(?:(Lot [0-9]) )?(?:([0-9]+) )?(.+)\z/;
    }
    push @addresses, \%address;
}

use Data::Dumper;
print Dumper \@addresses;

__DATA__
123 any street,
mytown

Lot 4 another road,
thattown

Lot 2 96 other road,
her town

yourtown

street,
town

Output:

$VAR1 = [
          {
            'lot' => undef,
            'number' => '123',
            'street' => 'any street',
            'town' => 'mytown'
          },
          {
            'lot' => 'Lot 4',
            'number' => undef,
            'street' => 'another road',
            'town' => 'thattown'
          },
          {
            'lot' => 'Lot 2',
            'number' => '96',
            'street' => 'other road',
            'town' => 'her town'
          },
          {
            'lot' => undef,
            'number' => undef,
            'street' => undef,
            'town' => 'yourtown'
          },
          {
            'lot' => undef,
            'number' => undef,
            'street' => 'street',
            'town' => 'town'
          }
        ];

Upvotes: 7

John La Rooy
John La Rooy

Reputation: 304147

This should separate into 3 parts - how do you distinguish the address/street?

(Lot \d*)? ?([^,]*,)? ?(.*)

here is the breakdown for your examples

('', '123 any street,', 'mytown')
('Lot 4', 'another road,', 'thattown')
('Lot 2', '96 other road,', 'her town')
('', 'this ave,', 'this town')
('', '', 'yourtown')

If I understand correctly, this one separates the address/street as well

(Lot \d*)? ?(\d*) ?([^,]*,)? ?(.*)

('', '123', 'any street,', 'mytown')
('Lot 4', '', 'another road,', 'thattown')
('Lot 2', '96', 'other road,', 'her town')
('', '', 'this ave,', 'this town')
('', '', '', 'yourtown')

Upvotes: 1

RJD22
RJD22

Reputation: 10340

I can't match the last one but for the first 3 ones you can use something like this:

if (preg_match('/(?:Lot (\d*)|)(?: |)(?:(\d*)|) (.*), (.*)/m', $subject, $regs)) {
    $result = $regs[1];
} else {
    $result = "";
}

this is the testing regex:

(?:Lot (\d*)|)(?: |)(?:(\d*)|) (.*), (.*)

You can use this in regexbuddy to test: link

Upvotes: 0

Hans W
Hans W

Reputation: 3891

I'd suggest you don't try to do all of this in a single regexp as it will be hard to verify its correctness.

First, I'd split at the comma. Whatever comes after the comma is the $town, and if there is no comma, the whole string is the $town.

Then I'd check if there is any lot information and extract it from the string.

Then I'd look for street/avenue number and name.

Divide and conquer :)

Upvotes: 7

Related Questions