nevets1219
nevets1219

Reputation: 7706

Scala Regular Expression Oddity

I have this regular expression:

^(10)(1|0)(.)(.)(.)(.{18})((AB[^|]*)\||(AQ[^|]*)\||(AJ[^|]*)\||(AF[^|]*)\||(CS[^|]*)\||(CR[^|]*)\||(CT[^|]*)\||(CK[^|]*)\||(CV[^|]*)\||(CY[^|]*)\||(DA[^|]*)\||(AO[^|]*)\|)+AY([0-9]*)AZ(.*)$

To give it a bit of organization, there's really 3 parts:

// Part 1
^(10)(1|0)(.)(.)(.)(.{18})

// Part 2
// Optional Elements that begin with two characters and is terminated by a |
// May appear at most once
((AB[^|]*)\||(AQ[^|]*)\||(AJ[^|]*)\||(AF[^|]*)\||(CS[^|]*)\||(CR[^|]*)\||(CT[^|]*)\||(CK[^|]*)\||(CV[^|]*)\||(CY[^|]*)\||(DA[^|]*)\||(AO[^|]*)\|)+

// Part 3
AY([0-9]*)AZ(.*)$

Part 2 is the part that I'm having trouble with but I believe the current regular expression says any of these given elements will appear one or more times. I could have done something like: (AB.*?|) but I don't need the pipe in my group and wasn't quite sure how to express it.

This is my sample input - it's SIP2 if you've seen it before (please disregard checksum, I know it's not valid):

101YNY201406120000091911AOa|ABb|AQc|AJd|CKe|AFf|CSg|CRh|CTi|CVj|CYk|DAl|AY1AZAA71

This is my snippet of Scala code:

val regex = """^(10)(1|0)(.)(.)(.)(.{18})((AB[^|]*)\||(AQ[^|]*)\||(AJ[^|]*)\||(AF[^|]*)\||(CS[^|]*)\||(CR[^|]*)\||(CT[^|]*)\||(CK[^|]*)\||(CV[^|]*)\||(CY[^|]*)\||(DA[^|]*)\||(AO[^|]*)\|)+AY([0-9]*)AZ(.*)$""".r
val msg = "101YNY201406120000091911AOa|ABb|AQc|AJd|CKe|AFf|CSg|CRh|CTi|CVj|CYk|DAl|AY1AZAA71"
val m = regex.findFirstMatchIn(msg)) match {
  case None => println("No match")
  case Some(x) =>
    for (i <- 0 to x.groupCount) {
      println(i + " " + x.group(i))
    }
}

This is my output:

0 101YNY201406120000091911AOa|ABb|AQc|AJd|CKe|AFf|CSg|CRh|CTi|CVj|CYk|DAl|AY1AZAA71
1 10
2 1
3 Y
4 N
5 Y
6 201406120000091911
7 DAl|
8 ABb
9 AQc
10 AJd
11 AFf
12 CSg
13 CRh
14 CTi
15 CKe
16 CVj
17 CYk
18 DAl
19 AOa
20 1
21 AA71

Note the entry that starts with 7. Can anyone explain why that's there?

I'm using Scala 2.10.4 but I believe regular expressions in Scala simply uses Java's regular expression. I'm certainly open to other suggestions for parsing strings.

EDIT: Based on wingedsubmariner's response, I was able to fix my regular expression:

^(10)(1|0)(.)(.)(.)(.{18})(?:AB([^|]*)\||AQ([^|]*)\||AJ([^|]*)\||AF([^|]*)\||CS([^|]*)\||CR([^|]*)\||CT([^|]*)\||CK([^|]*)\||CV([^|]*)\||CY([^|]*)\||DA([^|]*)\||AO([^|]*)\|)+AY([0-9]*)AZ(.*)$

Basically adding ?: to indicate I was not interested in the group!

Upvotes: 1

Views: 74

Answers (1)

wingedsubmariner
wingedsubmariner

Reputation: 13667

You get a matched group for each set of parentheses, the order being the order of the opening parenthesis in the regex. Matched group 7 corresponds to the opening parenthesis that begins your "Group 2":

((AB[^|]*)\||(AQ[^|]*)\||(AJ[^|]*)\||(AF[^|]*)\||(CS[^|]*)\||(CR[^|]*)\||(CT[^|]*)\||(CK[^|]*)\||(CV[^|]*)\||(CY[^|]*)\||(DA[^|]*)\||(AO[^|]*)\|)+
^
|
This parenthesis

Each matched group takes on the value of the last part of the text that matched, which in this case is DAl| because it was the last piece of text to match the "Group 2" expression.

Here is a simpler example that demonstrates the behavior:

val regex = """((A)\||(B)\|)+""".r
val msg = "A|B|A|B|"
regex.findFirstMatchIn(msg) match {
  case None => println("No match")
  case Some(x) =>
    for (i <- 0 to x.groupCount) {
      println(i + " " + x.group(i))
    }
}

Which produces:

0 A|B|A|B|
1 B|
2 A
3 B

Upvotes: 2

Related Questions