davis
davis

Reputation: 381

Regex - Ignore the white spaces

I have a regex:

Regex.Match(result, @"\bTop Rate\b.*?\s*\s*([\d,\.]+)", RegexOptions.IgnoreCase);

And then parse it into int

topRate = int.Parse(topRateMatch.Groups[1].Value, System.Globalization.NumberStyles.AllowThousands);

Example)

Top Rate: 888,888
Output: 888888

I'm getting the int output just fine by using my current Regex. However, I noticed that when there are whitespace(s) in between the numbers forexample,

Top Rate: 8         88,888

I only get an 8. Is there a way to just ignore any whitespaces that may or may not exist in between the numbers/after Top Rate letter?

Exmaple)

Top Rate:                       8                      88,888
Expected output: 888888

Top Rate:                       8     88,888
Expected output: 888888

Top Rate: 8                      88,888
Expected output: 888888

Top Rate: 8 8 8,888
Expected output: 888888

Top Rate: 888,          8  88
Expected output: 888888

Upvotes: 3

Views: 581

Answers (3)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627607

First of all, you cannot skip or omit whitespaces when matching and capturing the numbers, you could only do it by extracting several matches after a given string. However, there is an easy two-step approach.

You may add \s to match any whitespace or \p{Zs} and \t to match any horizontal whitespace to the character class. I would recommend capturing the number with \d first and then use an optional non-capturing group with a digit pattern at the end to make sure the number captured starts and ends with a digit:

\bTop Rate\b.*?(\d(?:[\d,.\s]*\d)?)

See the regex demo. Note that repeating \s*\s* makes little sense, \s* already matches zero or more whitespace chars, and even \s* is redundant due to .*? that matches any zero or more chars other than LF chars as few as possible. To make it match across lines, add the RegexOptions.Singleline option.

Details:

  • \bTop Rate\b - a whole word Top Rate
  • .*? - any zero or more chars other than a newline char as few as possible
  • (\d(?:[\d,.\s]*\d)?) - Group 1:
    • \d - a digit
    • (?:[\d,.\s]*\d)? - an optional non-capturing group matching zero or more digits, ,, . or whitespaces and then a digit.

Next, when you get the match only keep digits.

var text = "Top Rate: 8                      88,888";
var result = Regex.Match(text, @"\bTop Rate\b.*?(\d(?:[\d,.\s]*\d)?)", RegexOptions.Singleline);
if (result.Success)
{
    Console.WriteLine( new string(result.Groups[1].Value.Where(c => char.IsDigit(c)).ToArray()) );
}

See the C# demo. With multiple matching:

var text = "Top Rate: 8                      88,888 and Top Rate:                       8  \n   88,888";
var results = Regex.Matches(text, @"\bTop Rate\b.*?(\d(?:[\d,.\s]*\d)?)", RegexOptions.Singleline)
        .Cast<Match>()
        .Select(x => new string(x.Groups[1].Value.Where(c => char.IsDigit(c)).ToArray()));
foreach (var s in results)
{
    Console.WriteLine( s );
}

See this C# demo.

Upvotes: 2

A. Gopal Reddy
A. Gopal Reddy

Reputation: 380

I verified and found with a small change in Regex statement, you can achieve your goal.

First one:

enter image description here

Second one:

enter image description here

Upvotes: 0

Nicholas Carey
Nicholas Carey

Reputation: 74385

Something like this?

using System;
using System.Text.RegularExpressions;
                    
public class Program
{
  public static void Main()
  {
    string[] texts = {
      "This should Not match the Top Rate thing",
      " Top Rate    : 888,888 ",
      "Top    Rate   : 8 8 8 , 8 8 8 ",
    };
    Regex rxNonDigit = new Regex(@"\D+"); // matches 1 or more characters other than decimal digits.
    Regex rxTopRate = new Regex(@"
      ^           # match start of line, followed by
      \s*         # zero or more lead-in whitespace characters, followed by
      Top         # the literal 'Top', followed by
      \s+         # 1 or more whitespace characters,followed by
      Rate        # the literal 'Rate', followed by
      \s*         # zero or more whitespace characters, followed by
      :           # a literal colon ':', followed by
      \s*         # zero or more whitespace characters followed by
      (?<rate>    # an named (explicit) capture group, containing
        \d+       # - 1 or more decimal digits, followed by
        (         # - an unnamed group, containing
          (\s|,)+ #     - interstial whitespace or a comma, followed by
          \d+     #     - 1 or more decimal digits
        )*        #   the whole of which is repeated zero or more times
      )           # followed by
      \s*         # zero or more lead-out whitespace characters, followed by
      $           # end of line
    ", RegexOptions.IgnorePatternWhitespace|RegexOptions.ExplicitCapture );

    foreach ( string text in texts )
    {
      Match m = rxTopRate.Match(text);
      if (!m.Success)
      {
        Console.WriteLine("No Match: '{0}'", text);
      }
      else
      {
        string rawValue = m.Groups["rate"].Value;
        string cleanedValue = rxNonDigit.Replace(rawValue, "");
        Decimal value = Decimal.Parse(cleanedValue);

        Console.WriteLine(@"Matched: '{0}' >>> '{1}' >>> '{2}' >>> {3}",
          text,
          rawValue,
          cleanedValue,
          value
        );
      }
    }

  }
    
}

Upvotes: 0

Related Questions