mrbrahman
mrbrahman

Reputation: 527

Scala - String matches RegEx

This is on Scala 2.11.8

I'm trying to read and parse a text file in Scala. Seeing an unexpected behavior (for me) when trying to do string.matches.

Say I have a file.txt with below contents

#############
# HEADING 1
#############

- The zeroth line item, if there can be one
- First Line item
- Second Line item
- Here is the third
    and this one has some details
- A fourth one followed by empty line

- Fifth line item

Read the file, and parse the contents, thus -

val source = scala.io.Source.fromFile("file.txt")
val lines = try source.getLines.filterNot(_.matches("#.*")).mkString("\n") finally source.close
val items = lines.split("""(\n-|^-)\s""").filter(_.nonEmpty)

Now, trying to parse individual line items with their result:

// print the first few items
scala> items(0)
res0: String = The zeroth line item, if there can be one

scala> items(1)
res1: String = First Line item

scala> items(3)
res2: String =
Here is the third
    and this one has some details

scala> items(4)
res3: String =
"A fourth one followed by empty line
"

scala> items(5)
res4: String =
"Fifth line item

"

Now for some matching

// Matching the items with RegEx
scala> items(0).matches("The.*")
res5: Boolean = true

scala> items(1).matches("First.*")
res6: Boolean = true

scala> items(3).matches("Here is.*")
res7: Boolean = false                    // ??

scala> items(4).matches("A fourth.*")
res8: Boolean = false                    // ??


// But startsWith seems to recognize it just fine!
scala> items(3).startsWith("Here is")
res9: Boolean = true

scala> items(4).startsWith("A fourth")
res10: Boolean = true

// Even this doesn't match
scala> items(4).matches(".*A fourth.*")
res11: Boolean = false                    // ?

My observation is this happens only when the item contains anything but a single line. i.e. when the item spans multiple lines (including having an empty following line)

Is this behavior expected? How to consistently match using RegEx?

Upvotes: 0

Views: 357

Answers (1)

Andrey Tyukin
Andrey Tyukin

Reputation: 44967

Consider activating the DOTALL mode using the (?s) flag in the beginning of the regex. Example:

val text = 
  """|- The zeroth line item, if there can be one
     |- First Line item
     |- Second Line item
     |- Here is the third
     |    and this one has some details
     |- A fourth one followed by empty line
     |
     |- Fifth line item
     |
     |""".stripMargin


val items = text.split("""(\n-|^-)\s""").filter(_.nonEmpty)

def describeMatch(str: String, regex: String): Unit = {
  println("-" * 60)
  println("The string\n>>>%s<<<\n%s".format(
    str,
    (if (str.matches(regex)) "Matches" else "Doesn't match") + s" >>>$regex<<<"
  ))
}

describeMatch(items(0), "The.*")
describeMatch(items(1), "First.*")
describeMatch(items(3), "Here is.*")
describeMatch(items(3), "(?s)Here is.*")
describeMatch(items(4), "A fourth.*")
describeMatch(items(4), "(?s)A fourth.*")
describeMatch(items(4), ".*A fourth.*$")
describeMatch(items(4), "(?s)^A fourth.*$")

The output should speak for itself:

------------------------------------------------------------
The string
>>>The zeroth line item, if there can be one<<<
Matches >>>The.*<<<
------------------------------------------------------------
The string
>>>First Line item<<<
Matches >>>First.*<<<
------------------------------------------------------------
The string
>>>Here is the third
    and this one has some details<<<
Doesn't match >>>Here is.*<<<
------------------------------------------------------------
The string
>>>Here is the third
    and this one has some details<<<
Matches >>>(?s)Here is.*<<<
------------------------------------------------------------
The string
>>>A fourth one followed by empty line
<<<
Doesn't match >>>A fourth.*<<<
------------------------------------------------------------
The string
>>>A fourth one followed by empty line
<<<
Matches >>>(?s)A fourth.*<<<
------------------------------------------------------------
The string
>>>A fourth one followed by empty line
<<<
Doesn't match >>>.*A fourth.*$<<<
------------------------------------------------------------
The string
>>>A fourth one followed by empty line
<<<
Matches >>>(?s)^A fourth.*$<<<

Upvotes: 1

Related Questions