CoderX_599
CoderX_599

Reputation: 83

Regex to match on capital letter, digit or capital, lowercase, and digit

I'm working on an application which will calculate molecular weight and I need to separate a string into the different molecules. I've been using a regex to do this but I haven't quite gotten it to work. I need the regex to match on patterns like H2OCl4 and Na2H2O where it would break it up into matches like:

  1. H2
  2. O
  3. Cl4

  1. Na2
  2. H2
  3. O

The regex i've been working on is this:

([A-Z]\d*|[A-Z]*[a-z]\d*)

It's really close but it currently breaks the matches into this:

  1. H2
  2. O
  3. C
  4. l4

I need the Cl4 to be considered one match. Can anyone help me with the last part i'm missing in this. I'm pretty new to regular expressions. Thanks.

Upvotes: 8

Views: 10632

Answers (2)

Diego
Diego

Reputation: 18349

Note that if you expect international characters in your input such as letters with diacritic marks (ñ,é,è,ê,ë, etc), then you should use the corresponding unicode category. In your case, what you want is @"\p{Lu}\p{Ll}?\d*".

Upvotes: 1

Jim Mischel
Jim Mischel

Reputation: 133975

I think what you want is "[A-Z][a-z]?\d*"

That is, a capital letter, followed by an optional small letter, followed by an optional string of digits.

If you want to match 0, 1, or 2 lower-case letters, then you can write:

"[A-Z][a-z]{0,2}\d*"

Note, however, that both of these regular expressions assume that the input data is valid. Given bad data, it will skip over bad data. For example, if the input string is "H2ClxxzSO4", you're going to get:

  1. H2
  2. Clx
  3. S
  4. O4

If you want to detect bad data, you'll need to check the Index property of the returned Match object to ensure that it is equal to the beginning index.

Upvotes: 11

Related Questions