Sumak
Sumak

Reputation: 1051

Match street number from different formats without suffixes

We've a "street_number" field which has been freely filed over the years that we want to format. Using regular expressions, we'd like to to extract the real "street_number", and the "street_number_suffix".

Ex: 17 b, "street_number" would be 17, and "street_number_suffix" would be b.

As there's a dozen of different patterns, I'm having troubles to tune the regular expression correctly. I consider using 2 different regexes, one to extract the "street_number", and another to extract the "street_number_suffix"

Here's an exhaustive set of patterns we'd like to format and the expected output:

# Extract street_number using PCRE

input           street_number   street_number_suffix

19-21           19              null
2 G             2               G
A               null            A
1 bis           1               bis
3 C             3               C
N°10            10              null
17 b            17              b
76 B            76              B
7 ter           7               ter
9/11            9               null
21.3            21              3
42              42              null

I know I could invoke an expressions that matches any digits until a hyphen using \d+(?=\-). It could be extended to match until a hyphen OR a slash using \d+(?=\-|\/), thought, once I include \s to this pattern, 21 from 19-21 will match. Adding conditions may no be that simple, which is why I ask your help.

Could anyone give me a helping hand on this ? If it can help, here's a draft: https://regex101.com/r/jGK5Sa/4


Edit: at the time I'm editing, here's the closest regex I could find:

(?:(N°|(?<!\-|\/|\.|[a-z]|.{1})))\d+

Thought the full match of N°10 isn't 10 but N°10 (and our ETL doesn't support capturing groups, so I can't use /......(\d+)/)

Upvotes: 0

Views: 40

Answers (1)

The fourth bird
The fourth bird

Reputation: 163287

To get the street numbers, you could update the pattern to:

(?<![-/.a-z\d])\d+

Explanation

  • (?<! Negative lookbehind
    • [-/.a-z\d] Match any of the listed using a charater class
  • ) Close the negative lookbehind
  • \d+ Match 1+ digits

Regex demo

Upvotes: 2

Related Questions