flipout
flipout

Reputation: 23

Trouble with Regex matching in Perl

I'm having trouble matching a regex in Perl and was wondering if anyone had any insight:

Here is my regex: /^-MEMBER:\s+(\b[^,]+)(?:,\s(\b.{1,50}\b)\.?)?\s+ID#:\s+(\d+)$/

Here is what I'm matching:

-MEMBER: Doe, John H ID#: 3907

The regex works beautifully and matches the above line, but am having troubles with any lines that may not contain a Firstname and Middle. Example below:

-MEMBER: Doe, ID#: 3907

I'm having trouble matching with the current regex, both lines.

Thanks for any help!

Upvotes: 2

Views: 99

Answers (3)

Spencer Rathbun
Spencer Rathbun

Reputation: 14900

The trouble is that what you really what is a grammer describing your input. Attempting to describe it all in one go gets very complex, very fast. See the perl yapp module for an alternative.

However, if you insist in just using the regex, here we go:

/^-MEMBER: # start of line, match specific string
\s+ # must be followed by at least one whitespace char
(\b[^,]+) # now we need to match a word in a capture group
(?:,\s(\b.{1,50}\b)\.?)? # here's the pain, so lets deal with it below
\s+ # more whitespace
ID#: # match this string
\s+ # and some more whitespace
(\d+)$/ # digits at the end of the line

(
 ?: # cluster the following
 ,\s # comma, then a single space
 (
  \b.{1,50}\b # up to fifty "things" bounded by words
 ) # another capture group
 \.? # optional period
)? # zero or one of these I.E. optional capture

This is fragil, because it hard codes assumptions into your "language". Note how if we don't have a first/middle name, we are not allowed a comma since it is inside the optional group. That is the problem with your second test not matching.

Secondly, if we have a first/middle name section, it can include anything except a newline. This may not be what you want or expect.

The reason that parsers are useful is not necessarily because they allow you to have "context", though they do that. It is because it breaks your complex regex into small, manageable pieces connected together into a clearly defined whole. By learning such a tool the type of problem you have here become trivial to implement, and change.

Notice how your regex is attempting to define what is "valid" in each section. The last name (\b[^,]+) can have anything besides a comma! Is this what you want? What happens if valid names can only have [a-zA-Z_] in them? Is ;injectionattemptFTW!!;# a valid name? Design your program so that there is a limited, and obvious set of conditions. If a then valid, else fail is easy to reason about for simple as.

Unless you define all possible special cases, you will run into things that make this regex break. I can't define a perfect regex, so you have two options:

  1. Patch regex into even more complexity as special cases are identified
  2. Redesign to avoid need for complex regex

If you choose option one, then this regex fixes your current problem:

/^-MEMBER:\s+(\b[^,]+),?(?:\s(\b.{1,50}\b)\.?)?\s+ID#:\s+(\d+)$/

Upvotes: 0

foundry
foundry

Reputation: 31745

Your have placed your comma match inside your optional firstname group, so you can only match a comma in the presence of a firstname. If commas will accompany surnames without firstnames, you need to move it to the surname group.

/^-MEMBER:\s+(\b[^,]+,)(?:\s(\b.{1,50}\b)\.?)?\s+ID#:\s+(\d+)$/

Upvotes: 1

Gilles Quénot
Gilles Quénot

Reputation: 184965

This regex will match both lines :

/
    ^-MEMBER:\s+         # the beginning of the line with "-MEMBER: "
    .*?                  # non greedy
    \s+ID#:\s+(\d+)$     # space and end ID part
/x

Upvotes: 0

Related Questions