user9418
user9418

Reputation: 395

Break down address into array

I have a list of addresses that need to be broken down into an array.

So I started thinking of using explode to break each line into an array. Which would work fine on an address like this:

Adwell - Oxfordshire 51.68N 01.00W SU6999

But if I had an address like this:

Afan - Castell-nedd Phort Talbot (Neath Port Talbot) 51.63N 03.74W SS794938

it would cause problems.

I've been trying to play around with preg_match but can't get an expression to work so that I end up with:

0 => Adwell 1 => Oxfordshire 2 => 51.68N 3 => 01.00W 4 => SU6999

the output for the second address should be

0=> Afan 1=> Castell-nedd Phort Talbot (Neath Port Talbot) 2=> 51.63N 3=> 03.74W 4=> SS794938

Does anyone see a good way to achieve this with a regular expression?

Upvotes: 2

Views: 301

Answers (5)

Shiplu Mokaddim
Shiplu Mokaddim

Reputation: 57650

I think you dont need regex for that. Just simple explode call is enough.

explode(' ', "Adwell - Oxfordshire 51.68N 01.00W SU6999")

More advance way,

$str = "Afan - Castell-nedd Phort Talbot (Neath Port Talbot) 51.63N 03.74W SS794938";
$parts = array_filter(explode(' ', $str));
$ss = array_pop($parts);
$w = array_pop($parts);
$n = array_pop($parts);
$name = array_shift($parts);
$hash = array_shift($parts);
$result = array($name, implode($parts, ' '), $n, $w, $ss);
print_r($result);

Upvotes: 1

Susam Pal
Susam Pal

Reputation: 34204

<?php
// Solution.
function parseAddress($address)
{
    $matches = NULL; 
    preg_match('/([^-]*) - ([^\d]*) (\d\d\.\d\dN) (\d\d\.\d\dW) (.*)/',
               $address, $matches);
    return array_slice($matches, 1);
}

// Test case 1.
$parsed = parseAddress('Adwell - Oxfordshire 51.68N 01.00W SU6999');
var_dump($parsed);

// Test case 2.
$parsed = parseAddress('Afan - Castell-nedd Phort Talbot (Neath Port Talbot) ' .
                       '51.63N 03.74W SS794938');
var_dump($parsed);
?>

Output:

array(5) {
  [0]=>
  string(6) "Adwell"
  [1]=>
  string(11) "Oxfordshire"
  [2]=>
  string(6) "51.68N"
  [3]=>
  string(6) "01.00W"
  [4]=>
  string(6) "SU6999"
}
array(5) {
  [0]=>
  string(4) "Afan"
  [1]=>
  string(45) "Castell-nedd Phort Talbot (Neath Port Talbot)"
  [2]=>
  string(6) "51.63N"
  [3]=>
  string(6) "03.74W"
  [4]=>
  string(8) "SS794938"
}

Upvotes: 2

EmmanuelG
EmmanuelG

Reputation: 1051

I have been working on address parsing and the such for quite some time and unfortunately there is no solution that covers all of your bases. So what you need to determine is what is common within all addresses. To me this seems to be the stuff on the right. So I would parse those out first. Seems like you can explode by space and grab the last 3 items (pop x 3 or slice works). Then recombine (join) and regex it.

/([a-z]+)\s-\s([a-z\-)\s\(\)]+)/i

This would give you two batches of strings. One would be the first stuff and the second would be whatever remains. You would then need to check if there is anything in parenthesis and parse that stuff out accordingly.

I am not completely familiar with your address format unfortunately as I deal mostly with US based address strings/blocks. However, after you remove the common items from the end, the remaining string should have it's city/state/province parts easily identifiable. Either way, you need a gauntlet of regex and logic to ensure that the end result is as accurate as possible. Essentially you design a path for data to take as it comes in based on its format.

Good Luck!

Upvotes: 0

tdammers
tdammers

Reputation: 20721

You need to disambiguate your syntax better. From these two examples, my guess would be that the following should work:

  • split into two components, using ' - ' as the separator. The first component can be kept as is, the rest needs further processing.
  • from the rest, take the last 3 space-delimited parts, and keep the rest as-is.

So try this one:

/^(.*?)\s-\s(.*)\s+(\S+)\s+(\S+)\s+(\S+)$/

Without a more formal description of the expected input format, nobody will be able to give you a decisive answer though.

Upvotes: 1

Lee
Lee

Reputation: 10603

(.*)\s+-\s*(.*)\s+(\d+\.\d+N)\s*(\d+\.\d+W)\s*(SS\d+)

Probably the most flexible. ive made most of the whitespace optional except for where you see \s+ as it uses that as a sort of delimiter to free text

Upvotes: 0

Related Questions