Chris Holmes
Chris Holmes

Reputation: 11574

What Regex to capture Multiline Text Between Two Phrases?

I need to capture form data text from an email form by capturing what exists between elements.

The text I get in the body of the email is multiline with a lot of whitespace between keywords. I don't care about the whitespace; I'll trim it out, but I have to be able to capture what occurs between two form field descriptors.

The key phrases are really clear and unique, but I can't get the Regex to work:

Sample data:

Loan Number:

    123456789


Address:

    101 Main Street
My City, WA
99101


Servicemember Name:

    Joe Smith


Servicemember Phone Number:

    423-283-5000


Complaint Description:

    He has a complaint


Associate Information


Associate Name:

    Some Dude


Phone Login:

    654312


Complaint Date:

    1/10/2012

Regex (to capture the loan number, for example):

^Loan Number:(.*?)Address:.$

What am I missing>?

EDIT: Also, in addition to capturing data between the various form labels, I need to capture the data between the last label and the end of the file. After reading the responses here, I've been able to capture the data between form labels, but not the last piece of data, the Complaint Date.

Upvotes: 2

Views: 4373

Answers (3)

Jason McCreary
Jason McCreary

Reputation: 72981

What am I missing?

You'll need to drop the anchors (^ and $) and enable the dotall which allows the . to match new lines. Not familiar enough with C#, but it should be the m modifier. Check the docs.

Why is this so difficult?

Regular Expressions are a very powerful tool. With great power comes great responsibility. That is, no one said it would be easy...

UPDATE

After reviewing the question more closely, you have solid anchor points and a very specific capture (i.e. loan number digits. The following regular expression should work and without the modifier mentioned about.

Loan Number\s+(\d+)\s+Escalation Required

Upvotes: 2

Alan Moore
Alan Moore

Reputation: 75222

Your main problem is that you aren't specifying Multiline mode. Without that, ^ only matches the very beginning of the text and $ only matches the very end. Also, the (.*?) needs to match the line separators before and after the loan number in addition to the number itself, and it can't do that unless you specify Singleline mode.

There are two ways you can specify these matching modes. One is by passing the appropriate RegexOptions argument when you create the Regex:

Regex r = new Regex(@"^Loan Number(.*?)Escalation Required.$",
                    RegexOptions.Multiline | RegexOptions.Singleline);

The other is by adding "inline" modifiers to the regex itself:

Regex r = new Regex(@"(?ms)^Loan Number(.*?)Escalation Required.$");

But I recommend you do this instead:

Regex r = new Regex(@"(?m)^Loan Number\s*(\d+)\s*Escalation Required(?=\z|\r\n|[\r\n])");

About \s*(\d+)\s*:
In Singleline mode (known as DOTALL mode in some flavors), there's nothing to stop .*? from matching all the way to the end of the document, however long it happens to be. It will try to consume as little as possible thanks to the non-greedy modifier (?), but in cases where no match is possible, the regex engine will have to do a lot of pointless work before it admits defeat. I practically never use Singleline mode for that reason.

Singleline mode or not, don't use .* or .*? without at least considering something more specific. In this case, \s*(\d+)\s* has the advantage that it allows you to capture the loan number only. You don't have to trim whitespace or perform any other operations to extract the part that interests you.

About (?=\z|\r\n|[\r\n]):
According to the Unicode standard, $ in multiline mode should match before a carriage-return (\r) or before a linefeed (\n) if it's not preceded by \r--it should never match between \r and \n. There are several other single-character line separators as well, but the .NET regex flavor doesn't recognize anything but \n. Your source text (an email message) uses \r\n to separate lines, which is why you had to add that dot before the anchor: .$.

But what if you don't know which kind of line separators to expect? Realistically, \n or \r\n are by far the most common choices, but even if you disregard the others, .$ is going to fail half the time. (?=\z|\r\n|[\r\n]) is still a hack, but it's a much more portable hack. ;) It even handles \r (carriage-return only) the line separator associated with pre-OSX Macintosh systems.

Upvotes: 0

Brazol
Brazol

Reputation: 455

This one works for me:

Loan Number(?<Number>(.*\n)+)Escalation Required

Where Number named group is the result.

Upvotes: 0

Related Questions