Toothpick Anemone
Toothpick Anemone

Reputation: 4636

How can we write a regular expression (regex) to identify quantities with units, such as "54.20 grams"?

Below are some example test inputs.
Test inputs are ASCII-encoded strings.

TEST CASE INPUTS

arrhar = Array(100)
arrhar[1] = "Low Carb Orzo Low Carb Rice, High Protein, Great Low Carb Bread Company, Low Carb Pasta Rice, 7 g per pack"
arrhar[2] = "Helios Certified Organic Greek Orzo Pasta, 500gr"
arrhar[3] = "Barilla Orzo Pasta 15.73 oz."
arrhar[4] = "Pasta Granoro Il Primo Orzo 6 ounces per bag"
arrhar[5] = "Authentic Italian Orzo -- 6 OUNCE per bag"
arrhar[6] = "ORZO PASA 4 U! 1 BAGGY IZ 4.39-GRM"
arrhar.trim()   

TEST CASE OUTPUTS

out[1] = "7 g"    
out[2] = "500gr"     
out[3] = "15.73 oz"      
out[4] = "6 ounces"    
out[5] = "6 OUNCE"       
out[6] = "4.1-grm"    

English Description of Regular Expression

Suppose that we represent a string-matching pattern as a bulleted list.
bullet (1) describes the left-most part of the string.
bullet (2) describes the sub-string second from the left.
bullet (3) describes the third parts of the string
and so on...

  1. Numeric Quantity
    1. Zero or more digits (0, 1, 2, ...., 9)
    2. zero or one decimal points or commas
    3. Zero or more digits (0, 1, 2, ...., 9)
  2. Optional Delimiter
    1. Zero or more of any character except chars from the classes [A-Z], [a-z], and \d
  3. Unit
    1. Grams
      1. Any case insensitive sub-sequence of "GRAMS" a. "g" b. "GRMS" c. "gs" d. "Gms" e. et cetera...
    2. Ounces
      1. Z-ounces ... any case-insensitive substring of OUNCEZ
      2. S-ounces ... any case-insensitive substring of OUNCES

Regex Pieces

Appropriate regular expressions the left-part (integer-part) of a numeric quantity might be:

  • \d*
  • \d{0,}
  • [0-9]{0,}
  • [0123456789]*

A regex for zero or one decimal points is [\.,]?

A decimal number is \d*[\.,]\d

There might, or be not be, a delimiter between the number and the unit-specification.

A suitable regexp for the delimiter might be [^a-zA-Z0-9]*

Suppose that we write a regex for the number and delimiter, but not the units (e.g. "ounces"). We might have:

\d*[\.,]?\d[^a-zA-Z0-9]*?

I hope that the above would match "4.91...." or "4.91 "

A regex for sub-sequences of "GRAMS" might be: [Gg]?[Rr]?[Aa]?[Mm]?[Ss]?

A regex which captures something like "4.1-grm" is shown below:

\d*[\.,]?\d[^a-zA-Z0-9]*?[Gg]?[Rr]?[Aa]?[Mm]?[Ss]?

How can we get both grams and ounces.

Upvotes: 4

Views: 907

Answers (2)

Ryszard Czech
Ryszard Czech

Reputation: 18631

Use

/\d[.,\d]*\W*(?:gr?a?m?s?|ou?n?c?e?[zs]?)/i

See proof.

Explanation

--------------------------------------------------------------------------------
  \d                       digits (0-9)
--------------------------------------------------------------------------------
  [.,\d]*                  any character of: '.', ',', digits (0-9)
                           (0 or more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  \W*                      non-word characters (all but a-z, A-Z, 0-
                           9, _) (0 or more times (matching the most
                           amount possible))
--------------------------------------------------------------------------------
  (?:                      group, but do not capture:
--------------------------------------------------------------------------------
    g                        'g'
--------------------------------------------------------------------------------
    r?                       'r' (optional (matching the most amount
                             possible))
--------------------------------------------------------------------------------
    a?                       'a' (optional (matching the most amount
                             possible))
--------------------------------------------------------------------------------
    m?                       'm' (optional (matching the most amount
                             possible))
--------------------------------------------------------------------------------
    s?                       's' (optional (matching the most amount
                             possible))
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    o                        'o'
--------------------------------------------------------------------------------
    u?                       'u' (optional (matching the most amount
                             possible))
--------------------------------------------------------------------------------
    n?                       'n' (optional (matching the most amount
                             possible))
--------------------------------------------------------------------------------
    c?                       'c' (optional (matching the most amount
                             possible))
--------------------------------------------------------------------------------
    e?                       'e' (optional (matching the most amount
                             possible))
--------------------------------------------------------------------------------
    [zs]?                    any character of: 'z', 's' (optional
                             (matching the most amount possible))
--------------------------------------------------------------------------------
  )                        end of grouping

Upvotes: 0

The fourth bird
The fourth bird

Reputation: 163477

Using a ? to make all the parts optional in [Gg]?[Rr]?[Aa]?[Mm]?[Ss]? could possibly also match RM or an empty string.

You might use a case insensitive match with an alternation | to list the possible alternatives making them a bit more specific.

\b\d+(?:[.,]\d+)?\s*(?:gr?|oz|ounces?|-grm|grams?)\b
  • \b A word boundary
  • \d+ Match 1+ digits
  • (?:[.,]\d+)? Optionally match either . or , and 1+ digits
  • \s* Match 0+ whitespace chars
  • (?:gr?|oz|ounces?|-grm|grams?) Match one of the alternatives
  • \b A word boundary

Regex demo

Another option for example is to nest non capture groups to make selected parts option, but in a certain order:

\b\d+(?:[.,]\d+)?\s*-?(?:g(?:r(?:a?ms?)?)?|oz|ounces?)\b

Regex demo

Upvotes: 3

Related Questions