Richard
Richard

Reputation: 25

PHP, extracting mailing address

I have a problem that I need help fixing. I am trying to create a script that crawls websites for mailing addresses. Mostly German addresses, but I am unsure of how to create said script, I have created one already that extracts email addresses from said websites. But the address one is puzzling because there isn't a real format.. Here is a couple German addresses for examples on a way to possibly extract this data.

Ilona Mustermann
Hauptstr. 76
27852 Musterheim


Andreas Mustermann
Schwarzwaldhochstraße 1
27812 Musterhausen


D. Mustermann
Kaiser-Wilhelm-Str.3
27852 Mustach

Those are just a few examples of what I am looking to extract from the websites. Is this possible to do with PHP?

Edit:

This is what I have so far

function extract_address($str) {
$str = strip_tags($str);
$Name = null;
$zcC = null;
$Street = null;

foreach(preg_split('/([^A-Za-z0-9üß\-\@\.\(\) .])+/', $str) as $token) {
    if(preg_match('/([A-Za-z\.])+ ([A-Za-z\.])+/', $token)){
        $Name = $token;
    }

    if(preg_match('/ /', $token)){
        $Street = $token;
    }

    if(preg_match('/[0-9]{5} [A-Za-zü]+/', $token)){
        $zcC = $token;
    }

    if(isset($Name) && isset($zcC) && isset($Street)){
        echo($Name."<br />".$Street."<br />".$zcC."<br /><br />");
        $Name = null;
        $Street = null;
        $zcC = null;
    }
    }
}

It works to retrieve $Name(IE: Ilona Mustermann and City/zipcode(27852 Musterheim) but unsure of a regex to always retrieve streets?


Well this is what I have came up with so far, and it seems to be working about 60% of the time on streets, zip/city work 100% and so does name. But when it tries to extract the street occasionally it fails.. Any idea why?

function extract_address($str) {
    $str = strip_tags($str);
    $Name = null;
    $zcC = null;
    $Street = null;

    foreach(preg_split('/([^A-Za-z0-9üß\-\@\.\(\)\& .])+/', $str) as $token) {
        if(preg_match('/([A-Za-z\&.])+ ([A-Za-z.])+/', $token) && !preg_match('/([A-Za-zß])+ ([0-9])+/', $token)){
            //echo("N:$token<br />");
            $Name = $token;
        }

        if(preg_match('/(\.)+/', $token) || preg_match('/(ß)+/', $token) || preg_match('/([A-Za-zß\.])+ ([0-9])+/', $token)){
            $Street = $token;
        }

        if(preg_match('/([0-9]){5} [A-Za-züß]+/', $token)){
            $zcC = $token;
        }

        /*echo("<br />
            N:$Name
            <br />
            S:$Street
            <br />
            Z:$zcC
            <br />
            ");*/

        if(isset($Name) && isset($zcC) && isset($Street)){
            echo($Name."<br />".$Street."<br />".$zcC."<br /><br />");
            $Name = null;
            $Street = null;
            $zcC = null;
        }
    }
}

Upvotes: 0

Views: 394

Answers (3)

mvw
mvw

Reputation: 5105

Vlad Bondarenko is right.

In CS speak: Postal addresses do not form a regular language.

Extracting information is an active research topic. Regular expressions are not completely bogus, but will have a higher failure rate than approaches that use dictionaries ("gazetteers") or more advanced machine learning algorithms.

A nice stack overflow q/a is How to parse freeform street/postal address out of text, and into components

Upvotes: 0

Vlad
Vlad

Reputation: 795

It's impossible to get a reliable answer with regex with such a complicated string. That's the only correct answer to this question.

Upvotes: 1

Robert
Robert

Reputation: 20286

Of course it is possible you need to use preg_match() function. It is all about making a good regex pattern.

For example to get post-code

<?php
$str = "YOUR ADRESSES STRING HERE";
preg_match('/([0-9]+) ([A-Za-z]+)/', $str, $matches);
print_r($matches);

?>

this regex matches adresses you've given you need to put in it also your native characters.

 [A-Za-züß.]+ [A-Za-z.üß]+\s[A-Za-z. 0-9ß-]+\s[0-9]+ [A-Za-züß.]+

Upvotes: 1

Related Questions