Reputation: 4636
Below are some example test inputs.
Test inputs are ASCII-encoded strings.
arrhar = Array(100) arrhar[1] = "Low Carb Orzo Low Carb Rice, High Protein, Great Low Carb Bread Company, Low Carb Pasta Rice, 7 g per pack" arrhar[2] = "Helios Certified Organic Greek Orzo Pasta, 500gr" arrhar[3] = "Barilla Orzo Pasta 15.73 oz." arrhar[4] = "Pasta Granoro Il Primo Orzo 6 ounces per bag" arrhar[5] = "Authentic Italian Orzo -- 6 OUNCE per bag" arrhar[6] = "ORZO PASA 4 U! 1 BAGGY IZ 4.39-GRM" arrhar.trim()
out[1] = "7 g" out[2] = "500gr" out[3] = "15.73 oz" out[4] = "6 ounces" out[5] = "6 OUNCE" out[6] = "4.1-grm"
Suppose that we represent a string-matching pattern as a bulleted list.
bullet (1) describes the left-most part of the string.
bullet (2) describes the sub-string second from the left.
bullet (3) describes the third parts of the string
and so on...
[A-Z]
, [a-z]
, and \d
OUNCEZ
OUNCES
Appropriate regular expressions the left-part (integer-part) of a numeric quantity might be:
\d*
\d{0,}
[0-9]{0,}
[0123456789]*
A regex for zero or one decimal points is [\.,]?
A decimal number is \d*[\.,]\d
There might, or be not be, a delimiter between the number and the unit-specification.
56.1gr
56.1 gr
56.1-grams
A suitable regexp for the delimiter might be
[^a-zA-Z0-9]*
Suppose that we write a regex for the number and delimiter, but not the units (e.g. "ounces"). We might have:
\d*[\.,]?\d[^a-zA-Z0-9]*?
I hope that the above would match "4.91...."
or "4.91 "
A regex for sub-sequences of "GRAMS" might be:
[Gg]?[Rr]?[Aa]?[Mm]?[Ss]?
A regex which captures something like "4.1-grm"
is shown below:
\d*[\.,]?\d[^a-zA-Z0-9]*?[Gg]?[Rr]?[Aa]?[Mm]?[Ss]?
How can we get both grams and ounces.
Upvotes: 4
Views: 907
Reputation: 18631
Use
/\d[.,\d]*\W*(?:gr?a?m?s?|ou?n?c?e?[zs]?)/i
See proof.
Explanation
--------------------------------------------------------------------------------
\d digits (0-9)
--------------------------------------------------------------------------------
[.,\d]* any character of: '.', ',', digits (0-9)
(0 or more times (matching the most amount
possible))
--------------------------------------------------------------------------------
\W* non-word characters (all but a-z, A-Z, 0-
9, _) (0 or more times (matching the most
amount possible))
--------------------------------------------------------------------------------
(?: group, but do not capture:
--------------------------------------------------------------------------------
g 'g'
--------------------------------------------------------------------------------
r? 'r' (optional (matching the most amount
possible))
--------------------------------------------------------------------------------
a? 'a' (optional (matching the most amount
possible))
--------------------------------------------------------------------------------
m? 'm' (optional (matching the most amount
possible))
--------------------------------------------------------------------------------
s? 's' (optional (matching the most amount
possible))
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
o 'o'
--------------------------------------------------------------------------------
u? 'u' (optional (matching the most amount
possible))
--------------------------------------------------------------------------------
n? 'n' (optional (matching the most amount
possible))
--------------------------------------------------------------------------------
c? 'c' (optional (matching the most amount
possible))
--------------------------------------------------------------------------------
e? 'e' (optional (matching the most amount
possible))
--------------------------------------------------------------------------------
[zs]? any character of: 'z', 's' (optional
(matching the most amount possible))
--------------------------------------------------------------------------------
) end of grouping
Upvotes: 0
Reputation: 163477
Using a ?
to make all the parts optional in [Gg]?[Rr]?[Aa]?[Mm]?[Ss]?
could possibly also match RM
or an empty string.
You might use a case insensitive match with an alternation |
to list the possible alternatives making them a bit more specific.
\b\d+(?:[.,]\d+)?\s*(?:gr?|oz|ounces?|-grm|grams?)\b
\b
A word boundary\d+
Match 1+ digits(?:[.,]\d+)?
Optionally match either .
or ,
and 1+ digits\s*
Match 0+ whitespace chars(?:gr?|oz|ounces?|-grm|grams?)
Match one of the alternatives\b
A word boundaryAnother option for example is to nest non capture groups to make selected parts option, but in a certain order:
\b\d+(?:[.,]\d+)?\s*-?(?:g(?:r(?:a?ms?)?)?|oz|ounces?)\b
Upvotes: 3