Rockbot
Rockbot

Reputation: 973

How to find chemical formulas with regex

This problem might not be a specific programming issue but, I try to find chemical formulas like H20, C02 etc. in a scientic text and I use this:

(?<=[\l\u]|\.)\d+

This works - but now also every floating point number after the 'dot' is found:

0.1234 -> 1234 is selected.

Is there a chance to prevent this? Thanks in advance!

Upvotes: 2

Views: 1598

Answers (2)

Qtax
Qtax

Reputation: 33918

If you want to also match strings like H2O, CH3CH2CH2CH3, SiO2 you could use:

(?i)\b[a-z]+(?:\d+[a-z]+)*\b

or

\b(?:[A-Z][a-z]?)+(?:\d+(?:[A-Z][a-z]?)+)*\b

Upvotes: 1

Bergi
Bergi

Reputation: 665090

You might also include a negative lookbehind to prevent a preceding dot with a digit before it:

(?<=[\l\u.])(?<!\d\.)\d+

Upvotes: 1

Related Questions