flowit
flowit

Reputation: 1442

Scala regex match lines with special characters

I have a code segment that reads lines from a file and I want to filter certain lines out. Basically, I want to filter everything out that has not three tabulator-separated columns, where the first column is a number and the other two columns can contain every character except tabulator and newline (Dos & Unix).

I already checked my regex on http://www.regexr.com/ and there it works.

scala> val mystr = """123456\thttp://some.url/path/to/resource\t\x03U\x1D\x1F\x04D0B0@\xA0>\xA0<\x86:http://some.url/path/to/resource\x06\x08+\x06\x01\x05\x05\x07\x01\x01\x04C0A0?\n"""
scala> val myreg = "^[0-9]+(\t[^\t\r\n]+){2}(\n|\r\n)$"

scala> mystr.matches(myreg)
res2: Boolean = false

What I found out is that the problem is related to special characters. For example a simple example:

scala> val tabstr = """123456\t123456"""
scala> val tabreg = "^[0-9]+\t[0-9]+$"
scala> tabstr.matches(tabreg)
res3: Boolean = false

scala> val tabstr = "123456\t123456"
scala> val tabreg = "^[0-9]+\t[0-9]+$"
scala> tabstr.matches(tabreg)
res4: Boolean = true

It seems I mustn't use a raw string for my line (see mystr in the first code block). But if I don't use a raw string scala complains about

error: invalid escape character

So how can I deal with this messy input and still use my regex to filter out some lines?

Upvotes: 1

Views: 2581

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627468

You are using raw string literals. Inside raw string literals, \ is not used to escape sequences like tab \t or newline \n, the \n in a raw string literal is just 2 characters following each other.

In a regex, to match a literal \, you need to use 2 backslashes in a raw-string literal based regex, and 4 backslashes in a regular string.

So, to match all your inputs, you need to use the following regexps:

val mystr = """23456\thttp://some.url/path/to/resource\t\x03U\x1D\x1F\x04D0B0@\xA0>\xA0<\x86:http://some.url/path/to/resource\x06\x08+\x06\x01\x05\x05\x07\x01\x01\x04C0A0?\n"""
val myreg = """[0-9]+(?:\\t(?:(?!\\[trn]).)*){2}(?:\\r)?(?:\\n)"""
println(mystr.matches(myreg)) // => true
val tabstr = """123456\t123456"""
println(tabstr.matches("""[0-9]+\\t[0-9]+""")) // => true
val tabstr2 = "123456\t123456"
println(tabstr2.matches("""^[0-9]+(?:\\t|\t)[0-9]+$""")) // => true

Non-capturing groups are not of importance here, since you just need to check if a string matches (that means, you do not even need a ^ and $ since the whole input string must match) and you can still use capturing groups. If you later need to extract any matches/capturing groups, non-capturing groups will help you get a "cleaner" output structure, that is it.

The last two regexps are easy enough, (?:\\t|\t) matches either a \+t or a tab. \t just matches a tab.

The first one has a tempered greedy token (this is a simplified regex, a better one can be used with unrolling the loop method: [0-9]+(?:\\t[^\\]*(?:\\(?![trn])[^\\]*)*){2}(?:\\r)?(?:\\n)).

  • [0-9]+ - 1 or more digits
  • (?:\\t(?:(?!\\[trn]).)*){2} - tempered greedy token, 2 occurrences of a literal string \t followed by any characters but a newline other than 2-symbol combinations \t or \r or \n.
  • (?:\\r)? - 1 or 0 occurrences of \r
  • (?:\\n) - one occurrence of a literal combination of \ and n.

Upvotes: 5

Related Questions