ReignBough
ReignBough

Reputation: 51

RegEx for the following string

I am trying to create a regex for an ID with the following rules:

  1. Starts with A-Z, one or more times. (Main ID, mi)
  2. Followed with an optional dash. (delimiter)
  3. Followed with 0-9, one or more times. (Sub ID, si)
  4. Followed with an optional dash or dot. (delimiter)
  5. Followed with an optional a-z or 0-9, one or more times. (Main category, mc)
  6. Followed with an optional dash or dot. (delimiter)
  7. Followed with an optional a-z or 0-9, one or more times. (Sub category, sc)

The delimiters can be omitted if the ID is alternating alpha and numeric (A-01a1, A1.a.1). Delimiters is required if succeeding parts are both alpha or both numeric (A-1.1a, A1.2.3, A1a.a).

Here is what I have:

(?P<mi>[A-Z]+)-?(?P<si>[0-9]+)[\-\.]?(?P<mc>[a-z0-9])*[\-\.]?(?P<sc>[a-z0-9])*

Here is the result when I tried it:

ID      mi  si  mc  sc
A1      A   1
A001    A   001
AB-01   AB  01
A1aa    A   1   a      <<<<< mc=aa
A-01a1  A   01  1      <<<<< mc=a sc=1
A-1.1a  A   1   a      <<<<< mc=1 sc=a
A1.a1   A   1   1      <<<<< mc=a sc=1
A1.a.1  A   1   a   1
A1.2.3  A   1   2   3
A1a.a   A   1   a   a

Upvotes: 1

Views: 158

Answers (3)

Ro Yo Mi
Ro Yo Mi

Reputation: 15010

The * in your expression should be relocated to the inside of your capture groups

Also you can remove the slashes inside the character case

(?P<mi>[A-Z]+)-?(?P<si>[0-9]+)[\-\.]?(?P<mc>[a-z0-9])*[\-\.]?(?P<sc>[a-z0-9])*
                               ^ ^                   ^ ^ ^                   ^ 

Should look like:

(?P<mi>[A-Z]+)-?(?P<si>[0-9]+)[-.]?(?P<mc>[a-z0-9]*)[-.]?(?P<sc>[a-z0-9]*)

Upvotes: 0

Ro Yo Mi
Ro Yo Mi

Reputation: 15010

Description

(?<=&|^)xxx=true^(?P<MainID>[a-z]+)-?(?<SubID>[0-9]+)(?:[-.]?(?P<MainCategory>(?<=[-.])[a-z0-9]+(?=[-.\s])|[a-z]+|[0-9]+))?(?:[-.]?(?P<SubCategory>(?<=[-.])[a-z0-9]+(?=[-.\s])|[a-z]+|[0-9]+))?

Regular expression visualization

** To see the image better, simply right click the image and select view in new window

The regex does the following:

  • Starts with A-Z, one or more times. (Main ID, mi)
  • Followed with an optional dash. (delimiter)
  • Followed with 0-9, one or more times. (Sub ID, si)
  • Followed with an optional dash or dot. (delimiter)
  • Followed with an optional a-z or 0-9, one or more times. (Main category, mc)
  • Followed with an optional dash or dot. (delimiter)
  • Followed with an optional a-z or 0-9, one or more times. (Sub category, sc)

  • If a group of text is surrounded by delimiters or the end of the string then the characters are allowed to alternate between letters and numbers for the same capture group

  • If the string is not surrounded by delimiters then the only letters or numbers are allowed to be captured

Example

Live Demo

https://regex101.com/r/uH7zF3/1

Sample text

ID      mi  si  mc  sc
A1      A   1
A001    A   001
AB-01   AB  01
A1aa    A   1   a      <<<<< mc=aa
A-01a1  A   01  1      <<<<< mc=a sc=1
A-1.1a  A   1   a      <<<<< mc=1 sc=a
A1.a1   A   1   1      <<<<< mc=a sc=1
A1.a.1  A   1   a   1
A1.2.3  A   1   2   3
A1a.a   A   1   a   a

Sample Matches

MATCH 1
MainID  [24-25] `A`
SubID   [25-26] `1`

MATCH 2
MainID  [38-39] `A`
SubID   [39-42] `001`

MATCH 3
MainID  [54-56] `AB`
SubID   [57-59] `01`

MATCH 4
MainID  [69-70] `A`
SubID   [70-71] `1`
MainCategory    [71-73] `aa`

MATCH 5
MainID  [104-105]   `A`
SubID   [106-108]   `01`
MainCategory    [108-109]   `a`
SubCategory [109-110]   `1`

MATCH 6
MainID  [143-144]   `A`
SubID   [145-146]   `1`
MainCategory    [147-149]   `1a`

MATCH 7
MainID  [182-183]   `A`
SubID   [183-184]   `1`
MainCategory    [185-187]   `a1`

MATCH 8
MainID  [221-222]   `A`
SubID   [222-223]   `1`
MainCategory    [224-225]   `a`
SubCategory [226-227]   `1`

MATCH 9
MainID  [243-244]   `A`
SubID   [244-245]   `1`
MainCategory    [246-247]   `2`
SubCategory [248-249]   `3`

MATCH 10
MainID  [265-266]   `A`
SubID   [266-267]   `1`
MainCategory    [267-268]   `a`
SubCategory [269-270]   `a`

Explanation

^ assert position at start of a line
(?P<MainID>[a-z]+) Named capturing group MainID
[a-z]+ match a single character present in the list below
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
a-z a single character in the range between a and z (case insensitive)
-? matches the character - literally
Quantifier: ? Between zero and one time, as many times as possible, giving back as needed [greedy]
(?<SubID>[0-9]+) Named capturing group SubID
[0-9]+ match a single character present in the list below
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
0-9 a single character in the range between 0 and 9
(?:[-.]?(?P<MainCategory>(?<=[-.])[a-z0-9]+(?=[-.\s])|[a-z]+|[0-9]+))? Non-capturing group
Quantifier: ? Between zero and one time, as many times as possible, giving back as needed [greedy]
[-.]? match a single character present in the list below
Quantifier: ? Between zero and one time, as many times as possible, giving back as needed [greedy]
-. a single character in the list -. literally
(?P<MainCategory>(?<=[-.])[a-z0-9]+(?=[-.\s])|[a-z]+|[0-9]+) Named capturing group MainCategory
1st Alternative: (?<=[-.])[a-z0-9]+(?=[-.\s])
(?<=[-.]) Positive Lookbehind - Assert that the regex below can be matched
[-.] match a single character present in the list below
-. a single character in the list -. literally
[a-z0-9]+ match a single character present in the list below
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
a-z a single character in the range between a and z (case insensitive)
0-9 a single character in the range between 0 and 9
(?=[-.\s]) Positive Lookahead - Assert that the regex below can be matched
[-.\s] match a single character present in the list below
-. a single character in the list -. literally
\s match any white space character [\r\n\t\f ]
2nd Alternative: [a-z]+
[a-z]+ match a single character present in the list below
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
a-z a single character in the range between a and z (case insensitive)
3rd Alternative: [0-9]+
[0-9]+ match a single character present in the list below
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
0-9 a single character in the range between 0 and 9
(?:[-.]?(?P<SubCategory>(?<=[-.])[a-z0-9]+(?=[-.\s])|[a-z]+|[0-9]+))? Non-capturing group
Quantifier: ? Between zero and one time, as many times as possible, giving back as needed [greedy]
[-.]? match a single character present in the list below
Quantifier: ? Between zero and one time, as many times as possible, giving back as needed [greedy]
-. a single character in the list -. literally
(?P<SubCategory>(?<=[-.])[a-z0-9]+(?=[-.\s])|[a-z]+|[0-9]+) Named capturing group SubCategory
1st Alternative: (?<=[-.])[a-z0-9]+(?=[-.\s])
(?<=[-.]) Positive Lookbehind - Assert that the regex below can be matched
[-.] match a single character present in the list below
-. a single character in the list -. literally
[a-z0-9]+ match a single character present in the list below
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
a-z a single character in the range between a and z (case insensitive)
0-9 a single character in the range between 0 and 9
(?=[-.\s]) Positive Lookahead - Assert that the regex below can be matched
[-.\s] match a single character present in the list below
-. a single character in the list -. literally
\s match any white space character [\r\n\t\f ]
2nd Alternative: [a-z]+
[a-z]+ match a single character present in the list below
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
a-z a single character in the range between a and z (case insensitive)
3rd Alternative: [0-9]+
[0-9]+ match a single character present in the list below
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
0-9 a single character in the range between 0 and 9

Upvotes: 2

Kirill Polishchuk
Kirill Polishchuk

Reputation: 56212

I would use this one:

(?<mi>[A-Z]+)-?(?<si>[0-9]+)[-.]?(?<mc>[a-z0-9]*)[-.]?(?<sc>[a-z0-9]*)

Upvotes: 0

Related Questions