tRx
tRx

Reputation: 813

Regex optional groups and digit length

Maybe some regex-Master can solve my problem.

I have a big list with many addresses with no seperators( , ; ). The address string contains following Information:

regex_png

As you can see on the image above the last two test strings are not matching. I need the last two regex groups to be optional and the third group should be either 4 or 5 digits.

I tried (\d{4,5}) for allowing 4 and 5 digits. But this only works halfways as you can see here: https://regex101.com/r/ZurqHh/1
regex_4_5_digits (This sometimes mixes the street number and zipcode together)

I also tried (?:\d{5})? to make the third and fourth group optional. But this destroys my whole group layout... https://regex101.com/r/EgxeMy/1

regex_optional

This is my current regex:

/^([a-zäöüÄÖÜß\s\d.,-]+?)\s*([\d\s]+(?:\s?[-|+\/]\s?\d+)?\s*[a-z]?)?\s*(\d{5})\s*(.+)?$/im

Try it out yourself: https://regex101.com/r/zC8NCP/1

My brain is only farting at this moment and i can't think straight anymore.

Please help me fix this problem so i can die in peace.

Upvotes: 2

Views: 404

Answers (2)

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89547

It is difficult to parse addresses because we are halfway between formatted text and natural language. Here is a pattern that tries as much as possible to reduce the number of optional parameters to succeed with the examples offered without asking too much to the regex engine. To do this, I mainly rely on character classes, atomic groups, and a relatively accurate description of the street names. Obviously, all the examples of the question cannot be representative of reality and characters could be added or removed from the classes to deal with new cases. Nevertheless, the structure of this pattern is a good starting point.

~
^
(?<strasse> [\pL\d-]+ \.? (?> \h+ [\pL\d-]+ \.? )*? ) \h*
(?<nummer> \b (?> \d+ | [-+/\h]+ | [a-z] \b )*? )
(?: \h+ (?<plz> \d{4,5} )
    \h+ (?<stadt> .+ ) )?
$
~mxui

demo

Note that in the above link you can also see a previous version of this pattern with a more accurate description of the street number (a bit more efficient but longer).

Upvotes: 0

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626738

You can use

^(.*?)(?:\s+(\d+(?:\s*[-|+\/]\s*\d+)*\s*[a-z]?\b))?(?:\s+(\d{4,5})(?:\s+(.*))?)?$

See the regex demo (note all \s are replaced with \h to only match horizontal whitespaces).

Details:

  • ^ - start of string
  • (.*?) - Group 1: any zero or more chars other than line break chars
  • (?:\s+(\d+(?:\s*[-|+\/]\s*\d+)*\s*[a-z]?\b))? - an optional non-capturing group matching
    • \s+ - one or more whitespaces
    • (\d+(?:\s*[-|+\/]\s*\d+)*\s*[a-z]?\b) - Group 2:
      • \d+ - one or more digits
      • (?:\s*[-|+\/]\s*\d+)* - zero or more sequences of zero or more whitespaces, -, +, | or /, zero or more whitespaces, one or more digits
      • \s* - zero or more whitespaces
      • [a-z]?\b - an optional lowercase ASCII letter and a word boundary
  • (?:\s+(\d{4,5})\b(?:\s+(.*))?)? - an optional non-capturing group matching
    • \s+ - one or more whitespaces
    • (\d{4,5}) - Group 3: four or five digits
    • (?:\s+(.*))? - an optional sequence of one or more whitespaces and then any zero or more chars other than line break chars as many as possible
  • $ - end of string.

Please note that the (?:\s+(.*))? optional group must be inside the (?:\s+(\d{4,5})...)? group to work.

Upvotes: 2

Related Questions