Shiplu Mokaddim
Shiplu Mokaddim

Reputation: 57670

Lookahead assertion as condition in conditional subpattern on .NET regular expression

I have this huge RegEx for matching credit cards numbers. But its PCRE. Works flawlessly in PHP.

/(\d{13,16})(?(?=<)<|["']).*?(?=(?(?=>)>|["\'])\d{3,4}(?(?=<)<|["']))(?(?=>)>|["'])(\d{3,4})(?(?=<)<|["'])/is
// /i = ignore case
// /s = treat the subject as a single line

I convert it to .NET. Just added @ at the beginning and double the double quotes. I think its the proper procedure.

@"(\d{13,16})(?(?=<)<|[""]).*?(?=(?(?=>)>|[""])\d{3,4}(?(?=<)<|[""]))(?(?=>)>|[""])(\d{3,4})(?(?=<)<|[""])"

Now it doesn't match. I know PCRE and .NET implementation might not be same. But I think I can convert it to compatible one. I look up on MSDN reference. It seems my pattern has nothing special which could be PCRE specific.

After analyzing the pattern I found the (?(?=<)<|[""]) is not matching!. So made the regular expression simpler. Its now @"(?(?=q)qu|\w)\w+". And I am matching against "Queen, Quick, Qi etc"

PHP

Code

$data =  "Queen, Quick, Qi etc";
$pattern = "(?(?=q)qu|\w)\w+";
preg_match_all("/$pattern/is", $data, $matches);
print_r($matches);

Output

Array
(
    [0] => Array
        (
            [0] => Queen
            [1] => Quick
            [2] => etc
        )
)

C# .NET

Code

        string data = "Queen, Quick, Qi etc";
        string pattern = @"(?(?=q)qu|\w)\w+";
        Regex re = new Regex(pattern, RegexOptions.IgnoreCase | RegexOptions.Singleline);

        foreach (Match m in re.Matches(data))
        {
            if (m.Success)
            {
                //Console.WriteLine("Credit Card Number={0}, CCV={1}", m.Groups[1].Value, m.Groups[6].Value);
                for (int i = 1; i < m.Groups.Count; i++)
                {
                    Console.WriteLine("[{0}][{1}]", i, m.Groups[i].Value);
                    for (int j = 0; j < m.Groups[i].Captures.Count; j++)
                    {
                        Console.WriteLine("[{0}][{1}][{2}]", i, m.Groups[i].Value, m.Groups[i].Captures[j].Value);
                    }
                }
            }
        }

Output

Press any key to continue . . .

Output is nothing.

My questions are

  1. Does look-ahead assertion as condition in conditional sub-pattern work on .NET regular expression?
  2. How can I modify the simpler regular expression @"(?(?=q)qu|\w)\w+" so that it matches just like PHP in .NET?
  3. On the first regex (the huge one) on .NET, is there anything I can apply so it matches just like PHP?

Thanks

Upvotes: 1

Views: 491

Answers (1)

Tim Pietzcker
Tim Pietzcker

Reputation: 336418

1.: Conditionals work in .NET just as they do in PHP.

2.: The "simpler" regex is correct for .NET. You're just using it wrong:

You have no capturing groups in your regex. That means that the loop

for (int i = 1; i < m.Groups.Count; i++) {...}

is never executed because m.Groups.Count is 1.

The correct way would be something like

foreach (Match m in re.Matches(data))
{
   if (m.Success)
   {
       for (int i = 0; i < m.Groups.Count; i++) // Groups are numbered from zero
       {
           // Groups[0] is the entire match
           Console.WriteLine("[{0}][{1}]", i, m.Groups[i].Value);
       }
   }
} 

3.: Your regex is missing the single quotes.

Regex regexObj = new Regex(@"(\d{13,16})(?(?=<)<|[""']).*?(?=(?(?=>)>|[""'])\d{3,4}(?(?=<)<|[""']))(?(?=>)>|[""'])(\d{3,4})(?(?=<)<|[""'])", RegexOptions.Singleline);

would be a literal translation.

4.: You don't need the /i or Ignorecase parameter as there are no letters in your regex.

5.: (?(?=<)<|["']) makes no sense. It matches exactly the same text as [<"']. After all it means "if there is a <, then match a <. Otherwise, try to match a " or a '. There is no need to use a conditional regex at all.

So the entire regex can be simplified to

(\d{13,16})[<"'].*?(?=[>"']\d{3,4}[<"'])[>"'](\d{3,4})[<"']

6.: This shows another superfluous part of the regex more clearly: You have a lookahead assertion (?=[>"']\d{3,4}[<"']) that is followed by the exact same regex [>"'](\d{3,4})[<"'], so the lookahead can be dropped entirely.

End result:

(\d{13,16})[<"'].*?[>"'](\d{3,4})[<"']

or, in C#:

Regex regexObj = new Regex(@"(\d{13,16})[<""'].*?[>""'](\d{3,4})[<""']", RegexOptions.Singleline);

Upvotes: 2

Related Questions