Regex - Not matching NewLine when it should?

Question

Promise, last of my Regex questions for a while. ..Really.

I'm somehow getting newlines into some matches when I shouldn't, and I'm sure that it's something I'm misinterpreting, OR, the data I'm getting isn't what I expect. (Which IS possible..!)

I have a regex defined: new Regex(@"^\s*[0-9]{4}[A-Z]{2}[\s\*]\s*(?.*?)\-(?.*?)$", RegexOptions.Compiled | RegexOptions.Multiline | RegexOptions.IgnoreCase | RegexOptions.ExplicitCapture);

My document/string I get is formatted with the occasional line like:

0000AA Token1     - Value
0000AA Token2     - Value
0000AA Token3     - Value
0000AA Another Tok- Value

When I get all the tokens in order like this, the above regex works great. I get four matches:

Match#  
1      Token1      Value
2      Token2      Value
3      Token3      Value
4      Another Tok Value

This is good. However, sometimes the user will send me a file where the tokens have the occasional missing line, as in:

0000AA Token1     - Value
0000AA Token2     - Value
0000AA Token3     - Value
0000AA
0000AA Another Tok- Value

When this happens, my regex will give me the following values:

Match#            
1      Token1             Value
2      Token2             Value
3      Token3             Value
4      0000AA Another Tok Value

I know why, it's matching the #4's token starting with the line above it. However, when I change the 'token' grouping to (?[^ ]*?), I still get the same value in 'token'.

I feel like I'm missing something obvious, because if . was matching newlines when it shouldn't, more folks than I would be raisins a ruckus over it. I have checked the incoming string - newlines ARE , and not , but wondering if something else could be the problem.

Cheers again - Mike.

damix911 · Accepted Answer

The problem is in the \s after the alphanumeric code at the beginning; \s also matches newline, and you don't want to. You basically need to match \s AND NOT . This is not expressible with regular expression, but if you use the DeMorgan theorem, you can rewrite this expression:

\s AND NOT 
 = NOT(NOT \s OR 
)

It turns out the NOT \s can be written \S:

\s AND NOT 
 = NOT(NOT \s OR 
) = NOT(\S OR 
)

This is easily expressible as a regular expression:

\s AND NOT 
 = NOT(NOT \s OR 
) = NOT(\S OR 
) = [^\S
]

Hence, instead of \s use [^\S ], which means match everything except newline, and the negation of \s.

I did a few other changes in the same area because I felt like that some stuff was not necessary. You can add it back if you think it is.

Regex re = new Regex(@"^[0-9]{4}[A-Z]{2}[^\S
]*(?.*?)\-(?.*?)$", RegexOptions.Compiled | RegexOptions.Multiline | RegexOptions.IgnoreCase | RegexOptions.ExplicitCapture);

Regex - Not matching NewLine when it should?

Answers (1)

Related Questions