Jonathan
Jonathan

Reputation: 599

Regular Expressions: Determining if a String is either a number or variable

I am trying to combine two Regular Expression patterns to determine if a String is either a double value or a variable. My restrictions are as follows:

The variable can only begin with an _ or alphabetical letter (A-Z, ignoring case), but it can be followed by zero or more _s, letters, or digits.

Here's what I have so far, but I can't get it to work properly.

String varPattern = @"[a-zA-Z_](?: [a-zA-Z_]|\d)*";
String doublePattern = @"(?: \d+\.\d* | \d*\.\d+ | \d+ ) (?: [eE][\+-]?\d+)?";

String pattern = String.Format("({0}) | ({1})",
                             varPattern, doublePattern);
Regex.IsMatch(word, varPattern, RegexOptions.IgnoreCase)

It seems that it is capturing both Regular Expression patterns, but I need it to be either/or.

For example, _A2 2 is valid using the code above, but _A2 is invalid.

Some examples of valid variables are as follows:

_X6 , _ , A , Z_2_A

And some examples of invalid variables are as follows:

2_X6 , $2 , T_2$

I guess I just need clarification on the pattern format for the Regular Expression. The format is unclear to me.

Upvotes: 0

Views: 193

Answers (3)

Nicholas Carey
Nicholas Carey

Reputation: 74267

As noted, the literal whitespace you've put in your regular expressions is part of the regular expression. You won't get a match unless that same whitespace is in the text being scanned by the regular expression. If you want to use whitespace to make your regex, you'll need to specify RegexOptions.IgnorePatternWhitespace, after that, if you want to match any whitespace, you'll have to do so explicitly, either by specifying \s, \x20, etc.

It should be noted that if you do specify RegexOptions.IgnorePatternWhitespace, you can use Perl-style comments (# to end of line) to document your regular expression (as I've done below). For complex regular expressions, someone 5 years from now — who might be you! — will thank you for the kindness.

Your [presumably intended] patterns are also, I think, more complex than they need be. A regular expression to match the identifier rule you've specified is this:

[a-zA-Z_][a-zA-Z0-9_]*

Broken out into its constituent parts:

[a-zA-Z_]     # match an upper- or lower-case letter or an underscore, followed by
[a-zA-Z0-9_]* # zero or more occurences of an upper- or lower-case letter, decimal digit or underscore

A regular expression to match the conventional style of a numeric/floating-point literal is this:

([+-]?[0-9]+)(\.[0-9]+)?([Ee][+-]?[0-9]+)?

Broken out into its constituent parts:

(        # a mandatory group that is the integer portion of the value, consisting of
  [+-]?  # - an optional plus- or minus-sign, followed by
  [0-9]+ # - one or more decimal digits
)        # followed by
(        # an optional group that is the fractional portion of the value, consisting of
  \.     # - a decimal point, followed by
  [0-9]+ # - one or more decimal digits
)?       # followed by,
(        # an optional group, that is the exponent portion of the value, consisting of
  [Ee]   # - The upper- or lower-case letter 'E' indicating the start of the exponent, followed by
  [+-]?  # - an optional plus- or minus-sign, followed by
  [0-9]+ # - one or more decimal digits.
)?       # Easy!

Note: Some grammars differ as to whether the sign of the value is a unary operator or part of the value and whether or not a leading + sign is allowed. Grammars also vary as to whether something like 123245. is valid (e.g., is a decimal point with no fractional digits valid?)

To combine these two regular expression,

  • First, group each of them with parentheses (you might want to name the containing groups, as I've done):

    (?<identifier>[a-zA-Z_][a-zA-Z0-9_]*)
    (?<number>[+-]?[0-9]+)(\.[0-9]+)?([Ee][+-]?[0-9]+)?
    
  • Next, combine with the alternation operation, |:

    (?<identifier>[a-zA-Z_][a-zA-Z0-9_]*)|(?<number>[+-]?[0-9]+)(\.[0-9]+)?([Ee][+-]?[0-9]+)?
    
  • Finally, enclose the whole shebang in an @"..." literal and you should be good to go.

That's about all there is to it.

Upvotes: 2

Jeremy
Jeremy

Reputation: 6670

You should avoid having spaces in your regular expressions unless you explicitly set IgnorePatterWhiteSpace. To make sure you get only matches on complete words you should include the beginning of line (^) and end of line ($) characters. I would also suggest you build the entire expression pattern instead of using String.Format("({0}) | ({1})", ...) as you have here.

The below should work given your examples:

string pattern = @"(?:^[a-zA-Z_][a-zA-Z_\d]*)|(?:^\d+(?:\.\d+){0,1}(?:[Ee][\+-]\d+){0,1}$)";

Upvotes: 1

Andrew Clark
Andrew Clark

Reputation: 208475

Spaces are not ignored in regular expressions by default, so for each space in your current expressions it is looking for a space in that string. Add the RegexOptions.IgnorePatternWhitespace flag or remove the spaces from your expressions.

You will also want to add some beginning and end of string anchors (^ and $ respectively) so you do not match just part of a string.

Upvotes: 1

Related Questions