Reputation: 1442
I have a code segment that reads lines from a file and I want to filter certain lines out. Basically, I want to filter everything out that has not three tabulator-separated columns, where the first column is a number and the other two columns can contain every character except tabulator and newline (Dos & Unix).
I already checked my regex on http://www.regexr.com/ and there it works.
scala> val mystr = """123456\thttp://some.url/path/to/resource\t\x03U\x1D\x1F\x04D0B0@\xA0>\xA0<\x86:http://some.url/path/to/resource\x06\x08+\x06\x01\x05\x05\x07\x01\x01\x04C0A0?\n"""
scala> val myreg = "^[0-9]+(\t[^\t\r\n]+){2}(\n|\r\n)$"
scala> mystr.matches(myreg)
res2: Boolean = false
What I found out is that the problem is related to special characters. For example a simple example:
scala> val tabstr = """123456\t123456"""
scala> val tabreg = "^[0-9]+\t[0-9]+$"
scala> tabstr.matches(tabreg)
res3: Boolean = false
scala> val tabstr = "123456\t123456"
scala> val tabreg = "^[0-9]+\t[0-9]+$"
scala> tabstr.matches(tabreg)
res4: Boolean = true
It seems I mustn't use a raw string for my line (see mystr in the first code block). But if I don't use a raw string scala complains about
error: invalid escape character
So how can I deal with this messy input and still use my regex to filter out some lines?
Upvotes: 1
Views: 2581
Reputation: 627468
You are using raw string literals. Inside raw string literals, \
is not used to escape sequences like tab \t
or newline \n
, the \n
in a raw string literal is just 2 characters following each other.
In a regex, to match a literal \
, you need to use 2 backslashes in a raw-string literal based regex, and 4 backslashes in a regular string.
So, to match all your inputs, you need to use the following regexps:
val mystr = """23456\thttp://some.url/path/to/resource\t\x03U\x1D\x1F\x04D0B0@\xA0>\xA0<\x86:http://some.url/path/to/resource\x06\x08+\x06\x01\x05\x05\x07\x01\x01\x04C0A0?\n"""
val myreg = """[0-9]+(?:\\t(?:(?!\\[trn]).)*){2}(?:\\r)?(?:\\n)"""
println(mystr.matches(myreg)) // => true
val tabstr = """123456\t123456"""
println(tabstr.matches("""[0-9]+\\t[0-9]+""")) // => true
val tabstr2 = "123456\t123456"
println(tabstr2.matches("""^[0-9]+(?:\\t|\t)[0-9]+$""")) // => true
Non-capturing groups are not of importance here, since you just need to check if a string matches
(that means, you do not even need a ^
and $
since the whole input string must match) and you can still use capturing groups. If you later need to extract any matches/capturing groups, non-capturing groups will help you get a "cleaner" output structure, that is it.
The last two regexps are easy enough, (?:\\t|\t)
matches either a \
+t
or a tab. \t
just matches a tab.
The first one has a tempered greedy token (this is a simplified regex, a better one can be used with unrolling the loop method: [0-9]+(?:\\t[^\\]*(?:\\(?![trn])[^\\]*)*){2}(?:\\r)?(?:\\n)
).
[0-9]+
- 1 or more digits(?:\\t(?:(?!\\[trn]).)*){2}
- tempered greedy token, 2 occurrences of a literal string \t
followed by any characters but a newline other than 2-symbol combinations \t
or \r
or \n
.(?:\\r)?
- 1 or 0 occurrences of \r
(?:\\n)
- one occurrence of a literal combination of \
and n
.Upvotes: 5