Reputation: 534
Promise, last of my Regex questions for a while. ..Really.
I'm somehow getting newlines into some matches when I shouldn't, and I'm sure that it's something I'm misinterpreting, OR, the data I'm getting isn't what I expect. (Which IS possible..!)
I have a regex defined:
new Regex(@"^\s*[0-9]{4}[A-Z]{2}[\s\*]\s*(?<token>.*?)\-(?<value>.*?)$", RegexOptions.Compiled | RegexOptions.Multiline | RegexOptions.IgnoreCase | RegexOptions.ExplicitCapture);
My document/string I get is formatted with the occasional line like:
0000AA Token1 - Value
0000AA Token2 - Value
0000AA Token3 - Value
0000AA Another Tok- Value
When I get all the tokens in order like this, the above regex works great. I get four matches:
Match# <token> <value>
1 Token1 Value
2 Token2 Value
3 Token3 Value
4 Another Tok Value
This is good. However, sometimes the user will send me a file where the tokens have the occasional missing line, as in:
0000AA Token1 - Value
0000AA Token2 - Value
0000AA Token3 - Value
0000AA
0000AA Another Tok- Value
When this happens, my regex will give me the following values:
Match# <token> <value>
1 Token1 Value
2 Token2 Value
3 Token3 Value
4 0000AA Another Tok Value
I know why, it's matching the #4's token starting with the line above it. However, when I change the 'token' grouping to (?<token>[^\n]*?)
, I still get the same value in 'token'.
I feel like I'm missing something obvious, because if . was matching newlines when it shouldn't, more folks than I would be raisins a ruckus over it. I have checked the incoming string - newlines ARE \n
, and not \r\n
, but wondering if something else could be the problem.
Cheers again - Mike.
Upvotes: 4
Views: 2255
Reputation: 4453
The problem is in the \s after the alphanumeric code at the beginning; \s also matches newline, and you don't want to. You basically need to match \s AND NOT \n. This is not expressible with regular expression, but if you use the DeMorgan theorem, you can rewrite this expression:
\s AND NOT \n = NOT(NOT \s OR \n)
It turns out the NOT \s can be written \S:
\s AND NOT \n = NOT(NOT \s OR \n) = NOT(\S OR \n)
This is easily expressible as a regular expression:
\s AND NOT \n = NOT(NOT \s OR \n) = NOT(\S OR \n) = [^\S\n]
Hence, instead of \s use [^\S\n], which means match everything except newline, and the negation of \s.
I did a few other changes in the same area because I felt like that some stuff was not necessary. You can add it back if you think it is.
Regex re = new Regex(@"^[0-9]{4}[A-Z]{2}[^\S\n]*(?<token>.*?)\-(?<value>.*?)$", RegexOptions.Compiled | RegexOptions.Multiline | RegexOptions.IgnoreCase | RegexOptions.ExplicitCapture);
Upvotes: 4