ArjunLotay
ArjunLotay

Reputation: 119

Parsing number information from an ingredient string using regex

I am trying to extract the quantity information from an ingredient string where the unit has already been removed.

175 risotto rice
a little hot vegetable stock (optional)
1     coriander
salt pepper
1 0.5   extra virgin olive oil
1  mild onion
300 split red lentils
1.7   well-flavoured vegetable stock
4  carrots
1 head celery
100 stilton cheese
4   snipped  chives
salt pepper
225 dried flageolet beans

These are examples of the strings I am parsing, and the results should look like:

175

1

1 0.5
1
300
1.7
4
1
100
4

225

My current thinking is using [0-9]+[ ]*[0-9]*.?[0-9]* as the regex, however this is picking up the first character after the numerical values, for example 175 risotto rice is returning "175 r"

Upvotes: 0

Views: 209

Answers (2)

The fourth bird
The fourth bird

Reputation: 163477

In your regex you match .? which will match an optional character (any character except a newline character) and in your data what will be for example the r in risotto or c in coriander.

You could use an anchor to assert the start of the string and match 1+ digits followed by an optional part that matches a dot and 1+ digits.

After that match you could add the same optional pattern with a leading 1+ spaces or tabs:

^\d+(?:\.\d+)?(?:[ \t]+\d+(?:\.\d+))?

In Java

String regex = "^\\d+(?:\\.\\d+)?(?:[ \\t]+\\d+(?:\\.\\d+))?";

That will match

  • ^ Start of the string
  • \d+(?:\.\d+)? Match 1+ digits followed with an optional part ? that matches a dot and 1+ digits
  • (?: Non capturing group
    • [ \t]+\d+(?:\.\d+) match 1+ times a space or tab, 1+ digits and again followed with an optional part that matches a dot and 1+ digits
  • )? Close non capturing group and make it optional

Note that if you want to match the second pattern 0+ times instead of making it optional you could use * instead of ?

Regex demo | Java demo

Upvotes: 0

Vogel612
Vogel612

Reputation: 5647

The problem here is that you are not escaping the .? into a literal \.?. The exact behaviour is still somewhat unclear to me, but using your pattern and escaping the . in it should already provide you with the desired matching behavior.

Note that you can shorten [0-9] into \d:

^\d+\s*\d*\.?\d*

If you wanted to separately access each number group, you'd need capture groups to correctly deal with that

Upvotes: 1

Related Questions