Reputation: 5818

Regex for custom decimal and thousand separator

I am using the below regex the handle the custom thousand separator which could be any of the , or . or space character which works for the thousand separator and not for the decimal indicator.

I am trying to add a new capturing group to handle decimal indicator (, or .) with maximum 2 decimals but the regex breaks for thousand separator with it.

^[+]?(?:\d{1,3}(?:(,|.| )\d{3})*|\d+)?,?$

How to add a capturing group to handle decimal with custom character? Any Ideas?

Valid Inputs:

1234
123.45
123,45
1234.56
1234,56
123
1,234
12,345
1,234,567
12,345,678
123,456,789

12
1.234
12.345
1.234.567
12.345.678
123.456.789

123
1 234
12 345
123 456
1 234 567
12 345 678
123 456 789

123.4567
123,4567

1,345.67
1.345,67
1 345.67

12,345.67
12.345,67
12 345.67
123,456,789.34
123.456.789,34
123 456 789.34

Not Valid:

12.345.67
12,345,67
12 345 67
123 456 789 34

Upvotes: 0

Answers (4)

Luis Colorado

Reputation: 12645

Well, your specification is ambiguous, as accepting the decimal indicator as ',' you are allowing to parse 123,456 as the number 123456 or as the number 123.456 (one thousandth of it)? If you fix the ambiguity disallowing only a number of three decimals, you solve the ambiguity, but at a high cost, you need the user to understand that if he makes the mistake of using three decimals, he/she will obtain weird results under strange conditions (123,456 will be parsed as 123456.0 while 123,4560will do as 123.456) This is weird for a user to accept. It's more interesting to use the condition that a single , or . means a decimal point, while if you have both indicators, the first will be a group separator, while the second will be a decimal point.

IMHO I should never use the space as a decimal indicator (if using it as a group separator, just use it as the only digit group separator ---some programming languages e.g. Java, allow for _ to be used as a digit group separator), just nobody uses it. It's preferable to use no decimal indicator at all (making the number an integer, scaled 10, 100, or 1000 times, this has been used for long in desktop calculators) as quick data input people prefer to key the extra zeros, than to move the finger to locate de decimal point and then type two more digits for the most of the times. Don't say then if he has to go to the letters keyboard to find the space bar. (well, of course it is more difficult to go there to find the underscore _ char, but quick typers don't use group separators)

In other side, people normally don't key the thousands separators, but just for readability (the computers do it in printing, but never on reading). In this scenario, sometimes they want not the rigid situation of having groups of three digits, but to use them arbitrarily. This leads to some situations where the user wants to separate digits in groups of three left of the decimal point, while using groups of five or ten one the right (which is something you don't contemplate at all) making, e.g. PI to appear as:

3.14159 26535 89793 23846 264338 3

I agree that using the alternate decimal point as group separator could be interesting, but at both sides of the actual decimal point, and never forcing groups of three.

Anyway, just to fit on your specs, I've written the following lex(1) specification to parse your input.

pfx     [1-9][0-9]?[0-9]?
grp     [0-9][0-9][0-9]
dec     [0-9]*

e1      [+-]?{pfx}([.]{grp})*([,]{dec})?
e2      [+-]?{pfx}([,]{grp})*([.]{dec})?
e3      [+-]?{pfx}([ ]{grp})*([.,]{dec})?
e4      [+-]?[1-9][0-9]*([,.]{dec})?
e5      [+-]?0?([,.]{dec})?
%%
{e1}|{e2}|{e3}|{e4}|{e5}            printf("\033[32m[%s]\033[m\n", yytext);
[0-9., +-]*                         printf("\033[31m[%s]\033[m\n", yytext);
.                                   |
\n                                  |
\t                                  ;
%%
int main()
{
    yylex();
}

int yywrap()
{
    return 1;
}

Your regular expression, complete, should be something like:

[+-]?[0-9]{1,3}([ ][0-9]{3})*([,.]([0-9]{3}[ ])*[0-9]{1,3})?|[+-]?[0-9]{1,3}([ ][0-9]{3})*([,.][0-9]{0,2})?|[+-]?[0-9]{0,2}[,.]([0-9]{3}[ ])*[0-9]{1,3}|[+-]?[0-9]{1,3}([,][0-9]{3})*([.]([0-9]{3}[,])*[0-9]{1,3})?|[+-]?[0-9]{1,3}([,][0-9]{3})*([.][0-9]{0,2})?|[+-]?[0-9]{0,2}[.]([0-9]{3}[,])*[0-9]{1,3}|[+-]?[0-9]{1,3}([.][0-9]{3})*([,]([0-9]{3}[.])*[0-9]{1,3})?|[+-]?[0-9]{1,3}([.][0-9]{3})*([,][0-9]{0,2})?|[+-]?[0-9]{0,2}[,]([0-9]{3}[.])*[0-9]{1,3}|[+-]?[0-9]*[,.][0-9]+|[+-]?[0-9]+[,.][0-9]*|[+-]?[0-9]+

Note

Some regexp libraries, don't implement correctly the | operator, making it not actually conmutative as it should be (the worst case I know is regex101.com, see below), and forcing you to put the operands in some particular order to match some strings (this is a bug in the library, but unfortunately, this is spread) Below is the above (which works fine with sed(1)) and you'll see how it doesn't match correctly in reg101 (There should be far less matches).

I've written also a bash script (shown below) to use sed(1) with the above regexp, so you can see how it works at your site:

dig="[0-9]"

af0="${dig}{0,2}"
af1="${dig}{1,3}"
grp="${dig}{3}"

t01="[+-]?${af1}([ ]${grp})*([,.](${grp}[ ])*${af1})?"
t02="[+-]?${af1}([ ]${grp})*([,.]${af0})?"
t03="[+-]?${af0}[,.](${grp}[ ])*${af1}"

t04="[+-]?${af1}([,]${grp})*([.](${grp}[,])*${af1})?"
t05="[+-]?${af1}([,]${grp})*([.]${af0})?"
t06="[+-]?${af0}[.](${grp}[,])*${af1}"

t07="[+-]?${af1}([.]${grp})*([,](${grp}[.])*${af1})?"
t08="[+-]?${af1}([.]${grp})*([,]${af0})?"
t09="[+-]?${af0}[,](${grp}[.])*${af1}"

t10="[+-]?${dig}*[,.]${dig}+"
t11="[+-]?${dig}+[,.]${dig}*"
t12="[+-]?${dig}+"

s01="${t01}|${t02}|${t03}"
s02="${t04}|${t05}|${t06}"
s03="${t07}|${t08}|${t09}"
s04="${t10}|${t11}|${t12}"

reg="${s01}|${s02}|${s03}|${s04}"

echo "$reg"

sed -E -e "s/${reg}/<&>/g"

You can find all this code (and updates) here.

Upvotes: 2

Toto

Reputation: 91375

Assuming

123.4567
123,4567
123 4567

are not valid, you can use:

^[+-]?(?:(?:\d{1,3}(?:,\d{3})*|\d+)(?:\.\d\d)?|(?:\d{1,3}(?:\.\d{3})*|\d+)(?:,\d\d)?|(?:\d{1,3}(?: \d{3})*|\d+)(?:[,.]\d\d)?)$

Demo & explanation

Upvotes: 1

sahinakkaya

Reputation: 6056

There you go:

^[+]?(?:\d{1,3}(?:(,|.| )\d{3})*|\d+)?((?<!,\d{3})(,\d+)|(?<!\.\d{3})(\.\d+))?$

Regex 101 demo

Upvotes: 1

Pablo Prieto

Reputation: 41

The following regex will match all the cases from your example:

^[+]?(?:\d{1,3}(?:([,. ])\d{3})*|\d+)?(?:[,.]\d+?){0,1}$

The last part (?:[,.]?\d+?){0,1}, makes the matching of the decimal part optional.

Upvotes: 1

Regex for custom decimal and thousand separator

Answers (4)

Note

Related Questions