Michael Lafayette
Michael Lafayette

Reputation: 3072

Scala RegEx String extractors behaving inconsistently

I have two regular expression extractors.

One for .java files and the other is for .scala files

val JavaFileRegEx =
  """\S*
     \s+
     //
     \s{1}
     ([^\.java]+)
     \.java
  """.replaceAll("(\\s)", "").r

val ScalaFileRegEx =
  """\S*
     \s+
     //
     \s{1}
     ([^\.scala]+)
     \.scala
  """.replaceAll("(\\s)", "").r

I want to use these extractors above to extract a java file name and a scala file name from the example code below.

val string1 = " // Tester.java"
val string2 = " // Hello.scala"

string1 match {
  case JavaFileRegEx(fileName1) => println(" Java file: " + fileName1)
  case other => println(other + "--NO_MATCH")
}
string2 match {
  case ScalaFileRegEx(fileName2) => println(" Scala file: " + fileName2)
  case other => println(other + "--NO_MATCH")
}

I get this output indicating that the .java file matched but the .scala file did not.

 Java file: Tester
 // Hello.scala--NO_MATCH

How is it that the Java file matched but the .scala file did not?

Upvotes: 1

Views: 92

Answers (1)

rock321987
rock321987

Reputation: 11032

NOTE

[] denotes character class. It matches only a single character.

[^] denotes match anything except the characters present in the character class.

In your first regex

\S*\s+//\s{1}([^\.java]+)\.java

\S* matches nothing as there is space in starting

\s+ matches the space which is in starting

// matches // literally

\s{1} matches next space

You are using [^\.java] which says match anything except . or j or a or v or a which can be written as [^.jav].

So, the left string now to be tested is

Tester.java

(Un)luckily any character from Tester does not matches . or j or a or v until we encounter a .. So Tester is matched and then java is also matched.

In your second regex

\S*\s+//\s{1}([^\.scala]+)\.scala

\S* matches nothing as there is space in starting

\s+ matches the space which is in starting

// matches // literally

\s{1} matches next space

Now, you are using [^\.scala] which says that match anything except . or s or c or a or l or a which can be written as [^.scla].

You have now

Hello.scala

but (un)luckily Hello here contains l which is not allowed according to character class and the regex fails.

How to correct it?

I will modify only a bit of your regex

\S*\s+//\s{1}([^.]*)\.java
              <-->
   This says that match anything except .
   You can also use \w here instead if [^.]

Regex Demo

\S*\s+//\s{1}([^.]*)\.scala

Regex Demo

There is no need of {1} in \s{1}. You can simply write it as \s and it will match exactly one space like

\S*\s+//\s([^.]*)\.java

Upvotes: 1

Related Questions