Dominic Jonas
Dominic Jonas

Reputation: 5005

Regex to extract (german) street number

I have the following street constellations:

|               Street name               | extracted value |
| --------------------------------------- | --------------- |
| Lilienstr. 12a                          | 12a             |
| Hagentorwall 3                          | 3               |
| Seilerstr. 14 (Eingang Birkenstr.)      | 14              |
| Guentherstr. 43 B                       | 43 B            |
| Eberhard-Leibnitz Str. 1 WH 5B 241      | 1               |
| 1019-1781 Borderlinx C/O SEKO Logistics |        -        |

My Regex is partially working (https://regex101.com/r/KumamP/2):

\d+(?:[a-zA-Z]$|\s[a-zA-Z]$)?

Someone has got a better solution for me? Eberhard-Leibnitz Str. should only give me one result or none. 1019-1781 Borderlinx C/O SEKO Logistics should give me none result.

Upvotes: 2

Views: 2450

Answers (2)

Matias Albarello
Matias Albarello

Reputation: 1

Parsing address lines is not trivial. Many countries have their own special rules and Germany and Austria are really tricky.

To understand better the examples you provided, there's one in special that shows the point:

"Eberhard-Leibnitz Str. 1 WH 5B 241"

The "WH" here stands for "Wohnung", but they usually use just "W" (and use some separator like "//"). So it would be more like: "Eberhard-Leibnitz Str. 1 // W 5B 241"

It's also common to find "co" or "c/o" or "z. H" (abbreviation for "zu Händen von"). And anything that follows it, it's just the mailbox's name.

And last but not least, the address line could also contain the zip code + city name. Depends on the API you're interacting with, or if it's user input (it can get very wild then!).

So, to properly parse address lines, you should first normalize them, by removing that extra information. Then you can use a regex. Take a look at this gem: https://github.com/matiasalbarello/address_line_divider

Some good reads about the topic:

Upvotes: 0

splash
splash

Reputation: 13327

The following regex is working for your example

^[ \-a-zA-Z.]+\s+(\d+(\s?\w$)?)

https://regex101.com/r/KumamP/4

The basic assumption is (like your samples suggest), that valid "street constellations" always start with a street name followed by the street/house number.

The next regex is also working if there is an entry like Straße des 17. Juni 1:

^[ \-0-9a-zA-ZäöüÄÖÜß.]+?\s+(\d+(\s?[a-zA-Z])?)\s*(?:$|\(|[A-Z]{2})

https://regex101.com/r/KumamP/5

But as the commentators already wrote, it is difficult to distinguish via an regular expression between numerical street name parts and the street number. Even more if you allow "unspecified" suffixes like (Eingang Birkenstr.) or WH 5B 241 in your example.

Upvotes: 4

Related Questions