user1667474
user1667474

Reputation: 819

How to extract address components from a string?

I have a Xamarin Forms application that uses Xamarin. Mobile on the platforms to get the current location and then ascertain the current address. The address is returned in string format with line breaks.

The address can look like this:

111 Mandurah Tce
Mandurah WA 6210
Australia

or

The Glades
222 Mandurah Tce
Mandurah WA 6210
Australia

I have this code to break it down into the street address (including number), suburb, state and postcode (not very elegant, but it works)

string[] lines = address.Split(new string[] { Environment.NewLine }, StringSplitOptions.RemoveEmptyEntries);
List<string> addyList = new List<string>(lines);
int count = addyList.Count;
string lineToSplit = addyList.ElementAt(count - 2);
string[] splitLine = lineToSplit.Split(null);
List<string> splitList = new List<string>(splitLine);

string streetAddress = addyList.ElementAt (count - 3).ToString ();
string postCode = splitList.ElementAt(2);
string state = splitList.ElementAt(1);
string suburb = splitList.ElementAt(0);

I would like to extract the street number, and in the previous examples this would be easy, but what is the best way to do it, taking into account the number might be Lot 111 (only need to capture the 111, not the word LOT), or 123A or 8/123 - and sometimes something like 111-113 is also returned

I know that I can use regex and look for every possible combo, but is there an elegant built-in type solution, before I go writing any more messy code (and I know that the above code isn't particularly robust)?

Upvotes: 0

Views: 5339

Answers (4)

Kim Ryan
Kim Ryan

Reputation: 515

These simple regular expressions will account for many types of address formats, but have you considered all the possible variations, such as:

PO Box 123 suburb state post_code
Unit, Apt, Flat, Villa, Shop X Y street name 
7C/94 ALISON ROAD RANDWICK NSW 2031

and that is just to get the number. You will also have to deal with all the possible types of streets such as Lane, Road, Place, Av, Parkway.

Then there are street types such as:

12 Grand Ridge Road suburb_name

This could be interpreted as street = "Grand Ridge" and suburb = "Road suburb_name", as Ridge is also a valid street type.

I have done a lot of work in this area and found the huge number of valid address patterns meant simple regexs didn't solve the problem on large amounts of data.

I ended up develpping this parser http://search.cpan.org/~kimryan/Lingua-EN-AddressParse-1.20/lib/Lingua/EN/AddressParse.pm to solve the problem. It was originally written for Australian addresses so should work well for you.

Upvotes: 1

Carter
Carter

Reputation: 744

Yea I think you have to identify what will work.

If:

  • it is always in the address line and it must always start with a Digit
  • nothing else in that line can start with a digit (or if something else does you know which always comes in what order, ie the code below will always work if the street number is always first)
  • you want every contiguous character to the digit that isn't whitespace (the - and \ examples suggest that to me)

Then it could be as simple as:

var regx = new Regex(@"(?:\s|^)\d[^\s]*");
var mtch = reg.Match(addressline);

You would sort of have to sift and see if any of those assumptions are broken.

Upvotes: 0

Olivier Jacot-Descombes
Olivier Jacot-Descombes

Reputation: 112334

Regex can capture the parts of a match into groups. Each parentheses () defines a group.

([^\d]*)(\d*)(.*)

For "Lot 222 Mandurah Tce" this returns the following groups

Group 0: "Lot 222 Mandurah Tce" (the input string)
Group 1: "Lot "
Group 2: "222"
Group 3: " Mandurah Tce"

Explanation:

[^\d]* Any number (including 0) of any character except digits.
\d* Any number (including 0) of digits.
.* Any number (including 0) of any character.

string input = "Lot 222 Mandurah Tce";
Match match = Regex.Match(input, @"([^\d]*)(\d*)(.*)");
string beforeNumber = match.Groups[1].Value; // --> "Lot "
string number = match.Groups[2].Value;       // --> "222"
string afterNumber = match.Groups[3].Value;  // --> " Mandurah Tce"

If a group finds no match, match.Groups[i] will return an empty string ("") for that group.

Upvotes: 1

kay00
kay00

Reputation: 447

You could check if the content starts with a number for each entry in the splitLine.

string[] splitLine = lineToSplit.Split(addresseLine);

var streetNumber = string.empty;
foreach(var s in splitLine)
{
  //Get the first digit value
  if (Regex.IsMatch(s, @"^\d"))
  {
       streetNumber = s;
       break;
  }     
}

//Deal with empty value another way

Console.WriteLine("My streetnumber is " + s)

Upvotes: 0

Related Questions