Reputation: 43
I've been tearing my hair out for the last two hours with this and can't help feeling there's a simple solution that I'm not seeing. I am trying to process a string - a house number (as you would find in a street address) and break it up into four component parts.
The string can have four basic different patterns
A. a numeric value consisting of one or more digits e.g. 5
B. one or more digits followed by a single alphabetic character e.g. 5A
C. two numeric values consisting of one or more digits and joined by a
hyphen e.g. 5-6
D. two alphanumeric values (with each consisting of one or more digits
followed by a single alphabetic character) split by a hyphen e.g. 5A-6B
the string should always start with a numeric character (1-9) but everything else is optional
I need to end up with four values as follows
startnumber - it would be 5 in the example above
startsuffix - it would be A in the example above
endnumber - it would be 6 in the example above
endsuffix - it would be B in the example above
startnumber and endnumber can be one or more digits. startsuffix and endsuffix must be a single alphabetic character
I have some basic validation on my form that only allows 0-9, A-Z and the '-' character to be input
I've been hacking around with lots of if statements, is_numerics, strpos and so on but can't help feeling there's a more obvious answer there maybe using a regex but I'm really struggling. Any help would be gratefully received
Upvotes: 3
Views: 213
Reputation: 781935
I think this regexp should do it:
(\d+)([A-Z]?)(?:-(\d+)([A-Z]?))?
Capture groups 1 through 4 correspond to the four values you list.
This will also match addresses like 5-6B
. Regular expressions don't have memory, so it's not really feasible to require that there be a letter in the second part if and only if there's one in the first part, unless you use a conjunction of 4 different regular expressions to handle each case.
With this regular expression, the calling code can simply check that capture groups 2 and 4 are both empty or both non-empty.
Upvotes: 4
Reputation: 3781
It's a hack, but it should work:
(?<startnumber>\d+(?:(?<startsuffix>[A-Z]))?)(?:-(?<endnumber>\d+(?:(?<endsuffix>[A-Z]))?))?
Upvotes: 0
Reputation: 1403
You might try the following (this is in raw PCRE):
([0-9]+)([A-Z])?|([0-9]+)-([0-9]+)|([0-9]+)([A-Z])-([0-9]+)([A-Z])
The issue is that the capturing group will vary from run to run. If you're not concerned about validating the specific format, then you might try this:
([0-9]+)([A-Z])?(?:-([0-9]+)([A-Z])?)?
in which case the first capturing group would hold the startnumber, the second, the startsuffix, the third, the endnumber, and the fourth, the endsuffix. Unlike my first example, it won't confirm that the input actually matches one of the formats you specified (i.e., it will accept 2D-4 or 2-4D), but if that's not an issue, then it's probably easier to use.
Upvotes: 1