regexgoregex-lookaroundsregex-groupregex-greedy

Reputation: 97

How to match these string variations in regex?

The question is pretty straightforward, I want to process a string using regex compiler in Go, and break it apart into three substrings. I already have something that works but sadly it has some shortcomings, this is my regex expression

str := "ARM64x99Bar"
pattern := `^(?!^\d+$)[a-zA-Z0-9]+(\d+)(.*)$`

r := regexp.MustCompile(pattern)
matches := r.FindStringSubmatch(str)

for example, Foo9Bar works and I can obtain the three desired substrings, Foo, 9, Bar, however this breaks if there are double digits, for example, Foo99Bar returns, Foo9, 9, Bar, firstly, how can I improve upon this?

Secondly, it gets complicated when the string is for example, ARM64x99Bar, in this case, I would like to obtain ARM64x, 99, Bar as well.

So, in summary, the first substring group can be alphanumeric or words but can never start from numbers, the second substring group will always be numeric (one or more digits, max. double digits like 9 or 99), and the third substring group will always be english alphabets only.

Upvotes: 0

Answers (4)

Dimava

Reputation: 10919

^([a-zA-Z]|[a-zA-Z][a-zA-Z0-9]*[a-zA-Z])(\d+)([a-zA-Z]+)$

^               start
(
  [a-zA-Z]      a letter
  |             or
  [a-zA-Z]      starts with letter
  [a-zA-Z0-9]*  alphanumeric (\w is also `_`)
  [a-zA-Z]      ends with letter
)
(\d+)           digits
([a-zA-Z]+)     letters
$               end

Upvotes: 0

dognose

Reputation: 20909

So, in summary, the first substring group can be alphanumeric or words but can never start from numbers, the second substring group will always be numeric (one or more digits, max. double digits like 9 or 99), and the third substring group will always be english alphabets only.

Just design your pattern starting with the hardest constraints. In this case, it is easiest to work from right to left:

and the third substring group will always be english alphabets only

That is obviously easy: [a-zA-Z]+$

the second substring group will always be numeric (one or more digits, max. double digits like 9 or 99)

1 or 2 numbers in Front of the first pattern: [\d]{1,2}[a-zA-Z]+$

the first substring group can be alphanumeric or words but can never start from numbers

Or in other words, the first group is "a letter, followed by (optional) letters or numbers": ^[a-zA-z]{1}[a-zA-z0-9]* and put together with the first pattern: ^[a-zA-z]{1}[a-zA-z0-9]*[\d]{1,2}[a-zA-Z]+$

This should be pretty much what you need to MATCH - now add the matchgroups as required: ^([a-zA-z]{1}[a-zA-z0-9]*)([\d]{1,2})([a-zA-Z]+)$

The only case that is left unclear by your definition is now for example:

test235nothing - do you want test23 5 nothing or test2 35 nothing?

If the later would be prefered, you need to slightly adjust the pattern, as regex will generally work from left-to-right and therefore consume as much as possible with the first match group. A quite simple approach for example would be to "or" that pattern and put the version with two digits only for group 2 first ( the ?: just tells the engine that the surrounding braces shouldn't become a match group on their own):

^(?:([a-zA-z]{1}[a-zA-z0-9]*)([\d]{2})([a-zA-Z]+))|(?:([a-zA-z]{1}[a-zA-z0-9]*)([\d]{1})([a-zA-Z]+))$

https://regex101.com/r/2evLbb/2

Upvotes: 0

dda

Reputation: 6213

^([a-zA-Z]+[0-9]*?[a-zA-Z]+)(\d++)([a-zA-Z]\w*)$

I'm working in a regex playground on my phone so this may work as intended, or not... 😅

As you can see both Foo99Bar and ARM64x99Bar seem to be caught correctly.

Upvotes: 0

Bohemian

Reputation: 425288

First insert some commas between the terms, then split:

str := "ARM64x99Bar"
pattern := `([a-zA-Z])(\d)|(\d)([a-zA-Z])`

r := regexp.MustCompile(pattern)
str = r.ReplaceAllString(str, `$1$3,$2$4`)
str = r.ReplaceAllString(str, `$1$3,$2$4`)
matches := strings.Split(str, `,`)
fmt.Println(matches)

See live demo.

To only split between uppercase letters and digits, remove lowercase from the regex:

pattern := `([A-Z])(\d)|(\d)([A-Z])`

The replace captures a letter-digit sequence (or visa versa) capturing each character as its own group, then puts them back but with a comma in between (ready to split). Because it's an alternation, only one side will match, either group 1 and 2 will contain characters and groups 3 and 4 will be blank, or visa versa. Thus the replace of $1$3,$2$4 will always be 3 characters long.

Note that due to golang's lack of look around support, you have to execute the replace twice in case the end of a match is needed for the start of the next.

Upvotes: 0

How to match these string variations in regex?

Answers (4)

Related Questions