dingalla
dingalla

Reputation: 1249

Regex (C#) - how to match variable names that start with a colon

I need to distinguish variable names and non variable names in some expressions I am trying to parse. Variable names start with a colon, can have (but not begin with) numbers, and have underscores. So valid variable names are:

:x :_x :x2 :alpha_x   // etc

Then I have to pick out other words in the expression that don't begin with colons. So in the following expression:

:result = median(:x,:y,:z)

The variables would be :result, :x, :y, and :z while the other non-variable word would be median.

My regex to pick out the variable names is (this works):

:[a-zA-Z_]{1}[a-zA-Z0-9_]*

But I cannot figure out how to get the non-variable words. My regex for that is:

(?<!:)([a-zA-Z_]{1}[a-zA-Z0-9_]*)

The issue is, the match is only excluding the first character after the : like so:

enter image description here

Upvotes: 5

Views: 447

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627103

The (?<!:)([a-zA-Z_]{1}[a-zA-Z0-9_]*) regex still matches partial variable words because (?<!:) assures there is no : immediately to the left of the current location, and then matches an identifier without checking for a word boundary. So, in :alpha, lpha is matched because l is preceded with a char other than :.

Hence the problem is easy to solve by adding a word boundary before [a-zA-Z_]:

var words = Regex.Matches(s, @"(?<!:)\b[a-zA-Z_]\w*", RegexOptions.ECMAScript)
        .Cast<Match>()
        .Select(x => x.Value)
        .ToList();

See the regex demo. Note you do not need to wrap the whole pattern with a capturing group.

Pattern details

  • (?<!:) - make sure there is no : immediately to the left of the current location
  • \b - a word boundary: make sure there are no letters, digits or _ immediately to the left of the current location
  • [a-zA-Z_] - match an ASCII letter or _
  • \w* - 0+ ASCII letters, digits or _ (must be used with the ECMAScript option to only match ASCII letters and digits and make word boundary handle ASCII only)

Upvotes: 1

Tim Biegeleisen
Tim Biegeleisen

Reputation: 522084

The following pattern seems to work:

(?<=[^A-Za-z0-9_:])[a-zA-Z_]{1}[a-zA-Z0-9_]*

The lookbehind (?<=[^A-Za-z0-9_:]) asserts that what precedes is neither a character allowed in the variable name or a colon. This would then mark the start of a non variable word.

Demo

Upvotes: 1

Related Questions