Reputation: 119
I am trying to extract the quantity information from an ingredient string where the unit has already been removed.
175 risotto rice
a little hot vegetable stock (optional)
1 coriander
salt pepper
1 0.5 extra virgin olive oil
1 mild onion
300 split red lentils
1.7 well-flavoured vegetable stock
4 carrots
1 head celery
100 stilton cheese
4 snipped chives
salt pepper
225 dried flageolet beans
These are examples of the strings I am parsing, and the results should look like:
175
1
1 0.5
1
300
1.7
4
1
100
4
225
My current thinking is using [0-9]+[ ]*[0-9]*.?[0-9]*
as the regex, however this is picking up the first character after the numerical values, for example 175 risotto rice is returning "175 r"
Upvotes: 0
Views: 209
Reputation: 163477
In your regex you match .?
which will match an optional character (any character except a newline character) and in your data what will be for example the r
in risotto or c
in coriander.
You could use an anchor to assert the start of the string and match 1+ digits followed by an optional part that matches a dot and 1+ digits.
After that match you could add the same optional pattern with a leading 1+ spaces or tabs:
^\d+(?:\.\d+)?(?:[ \t]+\d+(?:\.\d+))?
In Java
String regex = "^\\d+(?:\\.\\d+)?(?:[ \\t]+\\d+(?:\\.\\d+))?";
That will match
^
Start of the string\d+(?:\.\d+)?
Match 1+ digits followed with an optional part ?
that matches a dot and 1+ digits(?:
Non capturing group
[ \t]+\d+(?:\.\d+)
match 1+ times a space or tab, 1+ digits and again followed with an optional part that matches a dot and 1+ digits)?
Close non capturing group and make it optionalNote that if you want to match the second pattern 0+ times instead of making it optional you could use *
instead of ?
Upvotes: 0
Reputation: 5647
The problem here is that you are not escaping the .?
into a literal \.?
. The exact behaviour is still somewhat unclear to me, but using your pattern and escaping the .
in it should already provide you with the desired matching behavior.
Note that you can shorten [0-9]
into \d
:
^\d+\s*\d*\.?\d*
If you wanted to separately access each number group, you'd need capture groups to correctly deal with that
Upvotes: 1