Dawid
Dawid

Reputation: 348

Regexp Pattern for matching Numbers with units and special symbols (ex. "120% g" etc.) and special grouping them

I would like to build Regexp pattern matcher which could match next possibilities:

11
2.5 
ca. 111g                   
ca. 120 g Case
11 Kilograms
12.5-125.0 g
ca. 120% g

In this cases I should become always 4 groups (use "ca. 12.5-125.0% g" as example):

  1. ca. (everything what is before numbers)
  2. 12.5-125.0 ()
  3. g (units)
  4. % (any special symbols after number)

I have already build this regex, but it's not working as I want in all situations above: (\d*[.]?[-]?\d+(?:\s*|\s+))(\w*)(\D). For example, groups are not build correct everytime and sometimes "g" lands in third group and sometimes in fourth..

Upvotes: 0

Views: 54

Answers (1)

The fourth bird
The fourth bird

Reputation: 163477

The possibility of g landing in the third or the fourth group is due to the fact that \D matches any char except a digit, which can also match chars a-z as \w can.

So for example in this string 1ga the g is in group 2. In this string 1g the g is in group 3, as the word characters are optional and \D expects at least a single char.

Note that this part of the pattern (?:\s*|\s+) can be written as \s*. You can use \s in the pattern, but it can also possibly match a newline.


One option could be making the pattern a bit more specific and list the allowed special symbols in a character class [%]?

^(?:(\w+\.) )?(\d+(?:\.\d+)?(?:-\d+(?:\.\d+))?)([%]?)(?: ?(\w+))?

The pattern matches

  • ^ Start of string
  • (?:(\w+\.) )? Optionally match a trailing space after capture group 1, which matches 1+ word chars and a dot
  • ( Capture group 2
    • \d+(?:\.\d+)? Match 1+ digits with an optional decimal part
    • (?:-\d+(?:\.\d+))? Optionally match - and 1+ digits with an optional decimal part
  • ) Close group 2
  • ([%]?) Capture group 3, match an optional "special" char
  • (?: ?(\w+))? Optionally match a space and capture group 4 to match 1+ word characters

regex demo

Without an anchor, you could also use word boundary \b and if the dot at the beginning is not always there, you can make it optional \.?

\b(?:(\w+\.?) )?(\d+(?:\.\d+)?(?:-\d+(?:\.\d+))?)([%]?)(?: ?(\w+))?

Regex demo

Upvotes: 1

Related Questions