catacarlo
catacarlo

Reputation: 33

Regex to extract (austrian) street housenumber/stairs/floor/door

I need to extract the housenumber with all the different constellations in austria:

|               Street name               | housenumber | stairs | floor | door |
| --------------------------------------- | ----------- | ------ | ----- | ---- |
| Lilienstr. 12a                          | 12a         |        |       |      |
| Leibnizstraße 36/28/2                   | 36          | 28     |       | 2    |
| Prager Straße 14/3/1/4                  | 14          | 3      | 1     | 4    |
| Guentherstr. 43 B                       | 43 B        |        |       |      |
| Eberhard-Leibnitz Str. 1/7              | 1           |        |       | 7    |
| Schießstätte 7/7                        | 7           |        |       | 7    |

I've already found this question: Regex to extract (german) street number.

This works if no stair/floor/door is entered. Can you help?

^[ \-0-9a-zA-ZäöüÄÖÜß.]+?\s+(\d+(\s?[a-zA-Z])?)\s*(?:$|\(|[A-Z]{2})

Upvotes: 3

Views: 364

Answers (3)

The fourth bird
The fourth bird

Reputation: 163287

The credits for the core of the pattern using the optional capturing groups with a positive lookahead go to @JvdV as he suggested with his pattern in the comments.

As an alternative, you can get the group numbers / names in the order of the specified schedule in the question, by capturing the digits of the stairs / floor / door and asserting how many parts of a forward slash followed by a digit are directly to the right.

If the assertion fails, the pattern will try the next part as all the groups are optional.

^(?<address>(?<streetname>\h*\S.*?)\h*(?<housenumber>\d+\h*[A-Za-z]?))(?:/(?<stairs>\d+)(?=(?:/\d+){1,2}))?(?:/(?<floor>\d)(?=(?:/\d+)))?(?:/(?<door>\d+))?$

Regex demo | Php demo

In parts

  • ^ Start of string
  • (?<address> Group address
    • (?<streetname> Group streetname
      • \h*\S.*? Match 0+ horizontal whitespace chars, a non whitepace char to make sure address is not empty and match any char as least as possible (non greedy)
    • ) Close group streetname
    • \h* Match 0+ horizontal whitespace chars for the trailing spaces after the streetname
    • (?<housenumber> Group housenumber
      • \d+\h*[A-Za-z]? Match 1+ digits, 0+ horizontal whitespace chars and optional char a-zA-Z
    • ) Close group housenumber
  • ) Close group address
  • (?: Non capture group
    • /(?<stairs>\d+) Group stairs, match 1+ digits
    • (?=(?:/\d+){1,2}) Positive lookahead, assert what is at the right is 1 or 2 times / followed by 1 or 2 digits
  • )? Close group and make it optional
  • (?: Non capture group
    • /(?<floor>\d+) Group floor, match 1+ digits
    • (?=(?:/\d)) Positive lookahead, assert what is at the right is / followed by a digit
  • )? Close group and make it optional
  • (?: Non capture group
    • /(?<door>\d+) Group door, match 1+ digits
  • )? Close group and make it optional
  • $ End of string

Example code

$re = '~^(?<address>(?<streetname>\h*\S.*?)\h*(?<housenumber>\d+\h*[A-Za-z]?))(?:/(?<stairs>\d+)(?=(?:/\d+){1,2}))?(?:/(?<floor>\d)(?=(?:/\d+)))?(?:/(?<door>\d+))?$~m';
$strings = [
    "Lilienstr. 12a",
    "Leibnizstraße 36/28/2",
    "Prager Straße 14/3/1/4",
    "Guentherstr. 43 B",
    "Eberhard-Leibnitz Str. 1/7",
    "Schießstätte 7/7"
];

foreach ($strings as $string) {
    preg_match_all($re, $string, $matches, PREG_SET_ORDER);
    $address = array_filter($matches[0], "is_string", ARRAY_FILTER_USE_KEY); // from php 5.6
    print_r($address);
}

Output

Array
(
    [address] => Lilienstr. 12a
    [streetname] => Lilienstr.
    [housenumber] => 12a
)
Array
(
    [address] => Leibnizstraße 36
    [streetname] => Leibnizstraße
    [housenumber] => 36
    [stairs] => 28
    [floor] => 
    [door] => 2
)
Array
(
    [address] => Prager Straße 14
    [streetname] => Prager Straße
    [housenumber] => 14
    [stairs] => 3
    [floor] => 1
    [door] => 4
)
Array
(
    [address] => Guentherstr. 43 B
    [streetname] => Guentherstr.
    [housenumber] => 43 B
)
Array
(
    [address] => Eberhard-Leibnitz Str. 1
    [streetname] => Eberhard-Leibnitz Str.
    [housenumber] => 1
    [stairs] => 
    [floor] => 
    [door] => 7
)
Array
(
    [address] => Schießstätte 7
    [streetname] => Schießstätte
    [housenumber] => 7
    [stairs] => 
    [floor] => 
    [door] => 7
)

Upvotes: 2

Levi Cole
Levi Cole

Reputation: 3684

Not knowing Austrian address formats it's hard for me to say if this is correct, however, please see the regex below.

^(.*)\s+(\d+(?:\s*[a-zA-Z])?)(?:\/(\d+))?(?:\/(\d+))?(?:\/(\d+))?\s*(?:$|\(|[A-Z]{2})

This expression will always match all 4 number groups (1/2/3/4) so you will need to do some additional processing to determin if an address has a housenumber and stairs and floor and door, compared to if an address only has a housenumber and door.

For example:

<?php

$pattern = '^(.*)\s+(\d+(?:\s*[a-zA-Z])?)(?:\/(\d+))?(?:\/(\d+))?(?:\/(\d+))?\s*(?:$|\(|[A-Z]{2})$';

$addresses = [
    'Lilienstr. 12a',
    'Leibnizstraße 36/28/2',
    'Prager Straße 14/3/1/4',
    'Guentherstr. 43 B',
    'Eberhard-Leibnitz Str. 1/7',
    'Schießstätte 7/7'
];

$results = [];

foreach ( $addresses as $address ) {
    
    // 0. Full match
    // 1. Streetname
    // 2. Housenumber
    // 3. Stairs
    // 4. Floor
    // 5. Door
    preg_match( '/' . $pattern . '/', $address, $matches );

    // Remove full match from 
    array_shift( $matches );
    
    // Set up default values
    $streetname = array_shift( $matches );
    $housenumber = null;
    $stairs = null;
    $floor = null;
    $door = null;

    // Count total values given
    $total = count( array_filter( array_map( 'trim', $matches ) ) );

    switch ( $total ) {

        // Has all 4 parts
        case 4:
            $housenumber = $matches[ 0 ];
            $stairs = $matches[ 1 ];
            $floor = $matches[ 2 ];
            $door = $matches[ 3 ];
            break;

        // Only has 3 parts
        case 3:
            $housenumber = $matches[ 0 ];
            $stairs = $matches[ 1 ];
            $door = $matches[ 2 ];
            break;

        // Only has 2 parts
        case 2:
            $housenumber = $matches[ 0 ];
            $door = $matches[ 1 ];
            break;

        // Has 1 part
        default:
            $housenumber = $matches[ 0 ];
            break;
    }

    // Add to results array
    $results[] = [
        'address' => $address,
        'streetname' => $streetname,
        'housenumber' => $housenumber,
        'stairs' => $stairs,
        'floor' => $floor,
        'door' => $door
    ];

}

print_r( $results );

Output

Array
(
    [0] => Array
        (
            [address] => Lilienstr. 12a
            [streetname] => Lilienstr.
            [housenumber] => 12a
            [stairs] => 
            [floor] => 
            [door] => 
        )

    [1] => Array
        (
            [address] => Leibnizstraße 36/28/2
            [streetname] => Leibnizstraße
            [housenumber] => 36
            [stairs] => 28
            [floor] => 
            [door] => 2
        )

    [2] => Array
        (
            [address] => Prager Straße 14/3/1/4
            [streetname] => Prager Straße
            [housenumber] => 14
            [stairs] => 3
            [floor] => 1
            [door] => 4
        )

    [3] => Array
        (
            [address] => Guentherstr. 43 B
            [streetname] => Guentherstr.
            [housenumber] => 43 B
            [stairs] => 
            [floor] => 
            [door] => 
        )

    [4] => Array
        (
            [address] => Eberhard-Leibnitz Str. 1/7
            [streetname] => Eberhard-Leibnitz Str.
            [housenumber] => 1
            [stairs] => 
            [floor] => 
            [door] => 7
        )

    [5] => Array
        (
            [address] => Schießstätte 7/7
            [streetname] => Schießstätte
            [housenumber] => 7
            [stairs] => 
            [floor] => 
            [door] => 7
        )

)

See here: http://sandbox.onlinephpfunctions.com/code/3952b2f3cab251e7137bcd9d55e42d8c8bcdd723

Upvotes: 1

marianc
marianc

Reputation: 449

Is this what you are looking for:

([a-zA-Z][ \-0-9a-zA-ZäöüÄÖÜß.\/]+\w)\s*\|\s+(\d+(?:\s?[a-zA-Z])?)\s*\|\s+(\d+)?\s*\|\s+(\d+)?\s*\|\s+(\d+)?

Please check the demo

Upvotes: 0

Related Questions