dafie
dafie

Reputation: 1169

Ignore whitespaces in regex

To parse this fragment:

Number: 1235, Title: "Today is a good day"

I am using this regex:

^Number: (\d+?), Title: \"(.*?)\"$

Unfortunately now I have to deal with corrupted data, like:

Nu mber: 1235, Title: "Today is a good day"
Numb er: 1235, Title: "Today is a bad day"
Nu mbe r: 1235, Title: "Foo"
Number: 1235, T itle: "Bar"
Nu mber: 1235, Tit le: "Example yyy"
Number: 1235, Title: "One"

I have to ignore that whitespaces in Number and Title words. I cannot just remove whitespaces from my regex and input text, because I have to keep spaces in text after Title fragment.

This solution seems to working:

^\s*N\s*u\s*m\s*b\s*e\s*r\s*:\s*(\d+?)\s*,\s*T\s*i\s*t\s*l\s*e\s*:\s*\"(.*?)\"\s*$

But it is really unreadable. Any ideas?

Also, I would like to mention, that I don't want to match something like this

Age: 99, Description: "Hi"

Upvotes: 0

Views: 117

Answers (1)

Adam Katz
Adam Katz

Reputation: 16246

You don't want ungreedy qualifiers there since there's no difference between longest and shortest match; it'll be (very slightly) faster to just use ^Number: (\d+), Title: \"(.*)\"$ in your example.

If you know it'll always be just Number and Title, you can assume them:

^N[^:]+:\s+(\d+),\s+T[^:]+:\s+\"(.*)\"$

If you can't make that assumption, use some C# code to copy the data into a temporary variable, collapse the white space in that variable and check that first.

I don't know C#, so this sample code is likely buggy, but it should still convey my thinking:

string input = 'Nu mber: 1235, Title: "Today is a good day"';
Match match = Regex.Match(input.Replace(" ", ""), @"^Number:\d+,Title:\".*\"$");
if (match.Success) {
  match = Regex.Matches(input, @"^N[^:]+:\s+(\d+),\s+T[^:]+:\s+\"(.*)\"$")
  if (match.Success) {
    // do stuff with match[1] (the number) and match[2] (the title)
  }
}

This checks to see if a version of the input with its spaces removed will match the template. We can't use that because we need the spaces in the Title, but this at least verifies the formatting. Then it uses the space-tolerant regex to match, saving the two desired fields.

Upvotes: 1

Related Questions