Reputation: 25
I have a problem that I need help fixing. I am trying to create a script that crawls websites for mailing addresses. Mostly German addresses, but I am unsure of how to create said script, I have created one already that extracts email addresses from said websites. But the address one is puzzling because there isn't a real format.. Here is a couple German addresses for examples on a way to possibly extract this data.
Ilona Mustermann
Hauptstr. 76
27852 Musterheim
Andreas Mustermann
Schwarzwaldhochstraße 1
27812 Musterhausen
D. Mustermann
Kaiser-Wilhelm-Str.3
27852 Mustach
Those are just a few examples of what I am looking to extract from the websites. Is this possible to do with PHP?
Edit:
This is what I have so far
function extract_address($str) {
$str = strip_tags($str);
$Name = null;
$zcC = null;
$Street = null;
foreach(preg_split('/([^A-Za-z0-9üß\-\@\.\(\) .])+/', $str) as $token) {
if(preg_match('/([A-Za-z\.])+ ([A-Za-z\.])+/', $token)){
$Name = $token;
}
if(preg_match('/ /', $token)){
$Street = $token;
}
if(preg_match('/[0-9]{5} [A-Za-zü]+/', $token)){
$zcC = $token;
}
if(isset($Name) && isset($zcC) && isset($Street)){
echo($Name."<br />".$Street."<br />".$zcC."<br /><br />");
$Name = null;
$Street = null;
$zcC = null;
}
}
}
It works to retrieve $Name(IE: Ilona Mustermann and City/zipcode(27852 Musterheim) but unsure of a regex to always retrieve streets?
Well this is what I have came up with so far, and it seems to be working about 60% of the time on streets, zip/city work 100% and so does name. But when it tries to extract the street occasionally it fails.. Any idea why?
function extract_address($str) {
$str = strip_tags($str);
$Name = null;
$zcC = null;
$Street = null;
foreach(preg_split('/([^A-Za-z0-9üß\-\@\.\(\)\& .])+/', $str) as $token) {
if(preg_match('/([A-Za-z\&.])+ ([A-Za-z.])+/', $token) && !preg_match('/([A-Za-zß])+ ([0-9])+/', $token)){
//echo("N:$token<br />");
$Name = $token;
}
if(preg_match('/(\.)+/', $token) || preg_match('/(ß)+/', $token) || preg_match('/([A-Za-zß\.])+ ([0-9])+/', $token)){
$Street = $token;
}
if(preg_match('/([0-9]){5} [A-Za-züß]+/', $token)){
$zcC = $token;
}
/*echo("<br />
N:$Name
<br />
S:$Street
<br />
Z:$zcC
<br />
");*/
if(isset($Name) && isset($zcC) && isset($Street)){
echo($Name."<br />".$Street."<br />".$zcC."<br /><br />");
$Name = null;
$Street = null;
$zcC = null;
}
}
}
Upvotes: 0
Views: 394
Reputation: 5105
Vlad Bondarenko is right.
In CS speak: Postal addresses do not form a regular language.
Extracting information is an active research topic. Regular expressions are not completely bogus, but will have a higher failure rate than approaches that use dictionaries ("gazetteers") or more advanced machine learning algorithms.
A nice stack overflow q/a is How to parse freeform street/postal address out of text, and into components
Upvotes: 0
Reputation: 795
It's impossible to get a reliable answer with regex with such a complicated string. That's the only correct answer to this question.
Upvotes: 1
Reputation: 20286
Of course it is possible you need to use preg_match() function. It is all about making a good regex pattern.
For example to get post-code
<?php
$str = "YOUR ADRESSES STRING HERE";
preg_match('/([0-9]+) ([A-Za-z]+)/', $str, $matches);
print_r($matches);
?>
this regex matches adresses you've given you need to put in it also your native characters.
[A-Za-züß.]+ [A-Za-z.üß]+\s[A-Za-z. 0-9ß-]+\s[0-9]+ [A-Za-züß.]+
Upvotes: 1