Reputation: 97
The question is pretty straightforward, I want to process a string using regex compiler in Go, and break it apart into three substrings. I already have something that works but sadly it has some shortcomings, this is my regex expression
str := "ARM64x99Bar"
pattern := `^(?!^\d+$)[a-zA-Z0-9]+(\d+)(.*)$`
r := regexp.MustCompile(pattern)
matches := r.FindStringSubmatch(str)
for example, Foo9Bar
works and I can obtain the three desired substrings, Foo, 9, Bar
, however this breaks if there are double digits, for example, Foo99Bar
returns, Foo9, 9, Bar
, firstly, how can I improve upon this?
Secondly, it gets complicated when the string is for example, ARM64x99Bar
, in this case, I would like to obtain ARM64x, 99, Bar
as well.
So, in summary, the first substring group can be alphanumeric or words but can never start from numbers, the second substring group will always be numeric (one or more digits, max. double digits like 9 or 99), and the third substring group will always be english alphabets only.
Upvotes: 0
Views: 292
Reputation: 10919
^([a-zA-Z]|[a-zA-Z][a-zA-Z0-9]*[a-zA-Z])(\d+)([a-zA-Z]+)$
^ start
(
[a-zA-Z] a letter
| or
[a-zA-Z] starts with letter
[a-zA-Z0-9]* alphanumeric (\w is also `_`)
[a-zA-Z] ends with letter
)
(\d+) digits
([a-zA-Z]+) letters
$ end
Upvotes: 0
Reputation: 20909
So, in summary, the first substring group can be alphanumeric or words but can never start from numbers, the second substring group will always be numeric (one or more digits, max. double digits like 9 or 99), and the third substring group will always be english alphabets only.
Just design your pattern starting with the hardest constraints. In this case, it is easiest to work from right to left:
and the third substring group will always be english alphabets only
That is obviously easy: [a-zA-Z]+$
the second substring group will always be numeric (one or more digits, max. double digits like 9 or 99)
1 or 2 numbers in Front of the first pattern: [\d]{1,2}[a-zA-Z]+$
the first substring group can be alphanumeric or words but can never start from numbers
Or in other words, the first group is "a letter, followed by (optional) letters or numbers": ^[a-zA-z]{1}[a-zA-z0-9]*
and put together with the first pattern: ^[a-zA-z]{1}[a-zA-z0-9]*[\d]{1,2}[a-zA-Z]+$
This should be pretty much what you need to MATCH - now add the matchgroups as required: ^([a-zA-z]{1}[a-zA-z0-9]*)([\d]{1,2})([a-zA-Z]+)$
The only case that is left unclear by your definition is now for example:
test235nothing
- do you want test23 5 nothing
or test2 35 nothing
?
If the later would be prefered, you need to slightly adjust the pattern, as regex will generally work from left-to-right and therefore consume as much as possible with the first match group. A quite simple approach for example would be to "or" that pattern and put the version with two digits only for group 2 first ( the ?:
just tells the engine that the surrounding braces shouldn't become a match group on their own):
^(?:([a-zA-z]{1}[a-zA-z0-9]*)([\d]{2})([a-zA-Z]+))|(?:([a-zA-z]{1}[a-zA-z0-9]*)([\d]{1})([a-zA-Z]+))$
https://regex101.com/r/2evLbb/2
Upvotes: 0
Reputation: 6213
^([a-zA-Z]+[0-9]*?[a-zA-Z]+)(\d++)([a-zA-Z]\w*)$
I'm working in a regex playground on my phone so this may work as intended, or not... 😅
As you can see both Foo99Bar
and ARM64x99Bar
seem to be caught correctly.
Upvotes: 0
Reputation: 425288
First insert some commas between the terms, then split:
str := "ARM64x99Bar"
pattern := `([a-zA-Z])(\d)|(\d)([a-zA-Z])`
r := regexp.MustCompile(pattern)
str = r.ReplaceAllString(str, `$1$3,$2$4`)
str = r.ReplaceAllString(str, `$1$3,$2$4`)
matches := strings.Split(str, `,`)
fmt.Println(matches)
See live demo.
To only split between uppercase letters and digits, remove lowercase from the regex:
pattern := `([A-Z])(\d)|(\d)([A-Z])`
The replace captures a letter-digit sequence (or visa versa) capturing each character as its own group, then puts them back but with a comma in between (ready to split). Because it's an alternation, only one side will match, either group 1 and 2 will contain characters and groups 3 and 4 will be blank, or visa versa. Thus the replace of $1$3,$2$4
will always be 3 characters long.
Note that due to golang's lack of look around support, you have to execute the replace twice in case the end of a match is needed for the start of the next.
Upvotes: 0