Reputation: 1051
We've a "street_number" field which has been freely filed over the years that we want to format. Using regular expressions, we'd like to to extract the real "street_number", and the "street_number_suffix".
Ex: 17 b
, "street_number" would be 17
, and "street_number_suffix" would be b
.
As there's a dozen of different patterns, I'm having troubles to tune the regular expression correctly. I consider using 2 different regexes, one to extract the "street_number", and another to extract the "street_number_suffix"
Here's an exhaustive set of patterns we'd like to format and the expected output:
# Extract street_number using PCRE
input street_number street_number_suffix
19-21 19 null
2 G 2 G
A null A
1 bis 1 bis
3 C 3 C
N°10 10 null
17 b 17 b
76 B 76 B
7 ter 7 ter
9/11 9 null
21.3 21 3
42 42 null
I know I could invoke an expressions that matches any digits until a hyphen using \d+(?=\-)
.
It could be extended to match until a hyphen OR a slash using \d+(?=\-|\/)
, thought, once I include \s
to this pattern, 21
from 19-21
will match. Adding conditions may no be that simple, which is why I ask your help.
Could anyone give me a helping hand on this ? If it can help, here's a draft: https://regex101.com/r/jGK5Sa/4
Edit: at the time I'm editing, here's the closest regex I could find:
(?:(N°|(?<!\-|\/|\.|[a-z]|.{1})))\d+
Thought the full match of N°10
isn't 10
but N°10
(and our ETL doesn't support capturing groups, so I can't use /......(\d+)/
)
Upvotes: 0
Views: 40
Reputation: 163287
To get the street numbers, you could update the pattern to:
(?<![-/.a-z\d])\d+
Explanation
(?<!
Negative lookbehind
[-/.a-z\d]
Match any of the listed using a charater class)
Close the negative lookbehind\d+
Match 1+ digitsUpvotes: 2