Coyttl
Coyttl

Reputation: 534

Regex - Not matching NewLine when it should?

Promise, last of my Regex questions for a while. ..Really.

I'm somehow getting newlines into some matches when I shouldn't, and I'm sure that it's something I'm misinterpreting, OR, the data I'm getting isn't what I expect. (Which IS possible..!)

I have a regex defined: new Regex(@"^\s*[0-9]{4}[A-Z]{2}[\s\*]\s*(?<token>.*?)\-(?<value>.*?)$", RegexOptions.Compiled | RegexOptions.Multiline | RegexOptions.IgnoreCase | RegexOptions.ExplicitCapture);

My document/string I get is formatted with the occasional line like:

0000AA Token1     - Value
0000AA Token2     - Value
0000AA Token3     - Value
0000AA Another Tok- Value

When I get all the tokens in order like this, the above regex works great. I get four matches:

Match# <token> <value>
1      Token1      Value
2      Token2      Value
3      Token3      Value
4      Another Tok Value

This is good. However, sometimes the user will send me a file where the tokens have the occasional missing line, as in:

0000AA Token1     - Value
0000AA Token2     - Value
0000AA Token3     - Value
0000AA
0000AA Another Tok- Value

When this happens, my regex will give me the following values:

Match# <token>           <value>
1      Token1             Value
2      Token2             Value
3      Token3             Value
4      0000AA Another Tok Value

I know why, it's matching the #4's token starting with the line above it. However, when I change the 'token' grouping to (?<token>[^\n]*?), I still get the same value in 'token'.

I feel like I'm missing something obvious, because if . was matching newlines when it shouldn't, more folks than I would be raisins a ruckus over it. I have checked the incoming string - newlines ARE \n, and not \r\n, but wondering if something else could be the problem.

Cheers again - Mike.

Upvotes: 4

Views: 2255

Answers (1)

damix911
damix911

Reputation: 4453

The problem is in the \s after the alphanumeric code at the beginning; \s also matches newline, and you don't want to. You basically need to match \s AND NOT \n. This is not expressible with regular expression, but if you use the DeMorgan theorem, you can rewrite this expression:

\s AND NOT \n = NOT(NOT \s OR \n)

It turns out the NOT \s can be written \S:

\s AND NOT \n = NOT(NOT \s OR \n) = NOT(\S OR \n)

This is easily expressible as a regular expression:

\s AND NOT \n = NOT(NOT \s OR \n) = NOT(\S OR \n) = [^\S\n]

Hence, instead of \s use [^\S\n], which means match everything except newline, and the negation of \s.

I did a few other changes in the same area because I felt like that some stuff was not necessary. You can add it back if you think it is.

Regex re = new Regex(@"^[0-9]{4}[A-Z]{2}[^\S\n]*(?<token>.*?)\-(?<value>.*?)$", RegexOptions.Compiled | RegexOptions.Multiline | RegexOptions.IgnoreCase | RegexOptions.ExplicitCapture);

Upvotes: 4

Related Questions