Reputation: 50127
I'm trying to create a method that provides "best effort" parsing of decimal inputs in cases where I do not know which of these two mutually exclusive ways of writing numbers the end-user is using:
The method is implemented as parse_decimal(..)
in the code below. Furthermore, I've defined 20 test cases that show how the heuristics of the method should work.
While the code below passes the tests it is quite horrible and unreadable. I'm sure there is a more compact and readable way to implement the method. Possibly including smarter use of regexpes.
My question is simply: Given the code below and the test-cases, how would you improve parse_decimal(...) to make it more compact and readable while still passing the tests?
Clarifications:
^\d{1,3}[\.,]\d{3}$
is ambiguous in that one cannot determine logically which character is used as thousands separator and which is used as a decimal separator. In ambiguous cases we'll simply assume that US-style decimals are used: "," as thousands separator and "." as decimal separator.The code in question including the test cases:
#!/usr/bin/perl -wT
use strict;
use warnings;
use Test::More tests => 20;
ok(&parse_decimal("1,234,567") == 1234567);
ok(&parse_decimal("1,234567") == 1.234567);
ok(&parse_decimal("1.234.567") == 1234567);
ok(&parse_decimal("1.234567") == 1.234567);
ok(&parse_decimal("12,345") == 12345);
ok(&parse_decimal("12,345,678") == 12345678);
ok(&parse_decimal("12,345.67") == 12345.67);
ok(&parse_decimal("12,34567") == 12.34567);
ok(&parse_decimal("12.34") == 12.34);
ok(&parse_decimal("12.345") == 12345);
ok(&parse_decimal("12.345,67") == 12345.67);
ok(&parse_decimal("12.345.678") == 12345678);
ok(&parse_decimal("12.34567") == 12.34567);
ok(&parse_decimal("123,4567") == 123.4567);
ok(&parse_decimal("123.4567") == 123.4567);
ok(&parse_decimal("1234,567") == 1234.567);
ok(&parse_decimal("1234.567") == 1234.567);
ok(&parse_decimal("12345") == 12345);
ok(&parse_decimal("12345,67") == 12345.67);
ok(&parse_decimal("1234567") == 1234567);
sub parse_decimal($) {
my $input = shift;
$input =~ s/[^\d,\.]//g;
if ($input !~ /[,\.]/) {
return &parse_with_separators($input, '.', ',');
} elsif ($input =~ /\d,\d+\.\d/) {
return &parse_with_separators($input, '.', ',');
} elsif ($input =~ /\d\.\d+,\d/) {
return &parse_with_separators($input, ',', '.');
} elsif ($input =~ /\d\.\d+\.\d/) {
return &parse_with_separators($input, ',', '.');
} elsif ($input =~ /\d,\d+,\d/) {
return &parse_with_separators($input, '.', ',');
} elsif ($input =~ /\d{4},\d/) {
return &parse_with_separators($input, ',', '.');
} elsif ($input =~ /\d{4}\.\d/) {
return &parse_with_separators($input, '.', ',');
} elsif ($input =~ /\d,\d{3}$/) {
return &parse_with_separators($input, '.', ',');
} elsif ($input =~ /\d\.\d{3}$/) {
return &parse_with_separators($input, ',', '.');
} elsif ($input =~ /\d,\d/) {
return &parse_with_separators($input, ',', '.');
} elsif ($input =~ /\d\.\d/) {
return &parse_with_separators($input, '.', ',');
} else {
return &parse_with_separators($input, '.', ',');
}
}
sub parse_with_separators($$$) {
my $input = shift;
my $decimal_separator = shift;
my $thousand_separator = shift;
my $output = $input;
$output =~ s/\Q${thousand_separator}\E//g;
$output =~ s/\Q${decimal_separator}\E/./g;
return $output;
}
Upvotes: 0
Views: 2581
Reputation: 8406
Trying to guess the locale of anything is always an ongoing effort, at best. What are you using this function for? The following tests look simply wrong to me:
ok(&parse_decimal("12.34") == 12.34);
ok(&parse_decimal("12.345") == 12345);
If I were parsing a single document with values on it I would be very irritated to find this result.
I would design this function with some knobs and dials in the package to use either the locale information (using localeconv()) or ad-hoc values (like in this answer.)
Edit: Okay, let me try to explain it better. For "single source" I mean a context or scope delimiter. I know you can import from different sources; that's the very nature of importing data. We also know that we cannot know beforehand the encoding of these different sources.
What I would do is to is to do a preliminary scan of the file being imported (just taking a sample, not reading it whole) and check the numeric values. If I can determine the locale from the sample, then I would try to import the whole file using the same locale. To me, one file is a single source, and I wouldn't expect it to suddenly change its locale.
That's why I would ask again: what is the purpose of this program?
Upvotes: 0
Reputation: 132886
The idea in these problems is to look at the code and figure out where you typed anything twice. When you see that, work to remove it. My program handles everything in your test data, and I don't have to repeat program logic structures to do it. That lets me focus on the data rather than program flow.
First, let's clean up your tests. You really have a set of pairs that you want to test, so let's put them into a data structure. You can add or remove items from the data structure as you like, and the tests will automatically adjust:
use Test::More 'no_plan';
my @pairs = (
# got expect
[ "1,234,567", 1234567 ],
[ "1,234567", 1.234567 ],
[ "1.234.567", 1234567 ],
[ "1.234567", 1.234567 ],
[ "12,345", 12345 ],
[ "12,345,678", 12345678 ],
[ "12,345.67", 12345.67 ],
[ "12,34567", 12.34567 ],
[ "12.34", 12.34 ],
[ "12.345", 12345 ], # odd case!
[ "12.345,67", 12345.67 ],
[ "12.345.678", 12345678 ],
[ "12.34567", 12.34567 ],
[ "123,4567", 123.4567 ],
[ "123.4567", 123.4567 ],
[ "1234,567", 1234.567 ],
[ "1234.567", 1234.567 ],
[ "12345", 12345 ],
[ "12345,67", 12345.67 ],
[ "1234567", 1234567 ],
);
Now that you have it in a data structure, your long line of tests reduces to a short foreach
loop:
foreach my $pair ( @pairs ) {
my( $original, $expected ) = @$pair;
my $got = parse_number( $original );
is( $got, $expected, "$original translates to $expected" );
}
The parse_number
routine likewise condenses into this simple code. Your trick is to find out what you are doing over and over again in the source and not do that. Instead of trying to figure out weird calling conventions and long chains of conditionals, I normalize the data. I figure out which cases are odd, then turn them into not-odd cases. In this code, I condense all of the knowledge about the separators into a handful of regexes and return one of two possible lists to show me what the thousands separator and decimal separator are. Once I have that, I remove the thousands separator completely and make the decimal separator the full stop. As I find more cases, I merely add a regex that returns true for that case:
sub parse_number
{
my $string = shift;
my( $separator, $decimal ) = do {
local $_ = $string;
if(
/\.\d\d\d\./ || # two dots
/\.\d\d\d,/ || # dot before comma
/,\d{4,}/ || # comma with many following digits
/\d{4,},/ || # comma with many leading digits
/^\d{1,3}\.\d\d\d\z/ || # odd case of 123.456
0
)
{ qw( . , ) }
else { qw( , . ) }
};
$string =~ s/\Q$separator//g;
$string =~ s/\Q$decimal/./;
$string;
}
This is the sort of thing I talk about in the dynamic subroutines chapter of Mastering Perl. Although I won't go into it here, I would probably turn that series of regexes into a pipeline of some sort and use a grep.
This is just the part of the program that passes your tests. I'd add another step to verify that the number is an expected format to deal with dirty data, but that's not so hard and is just a simple matter of programming.
Upvotes: 3
Reputation: 30099
To answer your question, not questioning the validity of the application,
Your algorithm boils down to the following pseudo-code:
if (num_separators == 1) {
if (trailing_digits != 3 || leading_digits > 3) {
# Replace comma with period
return output
}
}
# Replace first separator with nothing, and second with '.'
return output
Which is the following in perl:
my $output = $input;
my $num_separators = (scalar split /[\,\.]/, $input) - 1;
if($num_separators == 1) {
my ($leading_digits, $trailing_digits) = split /[\,\.]/, $input;
if(length($trailing_digits) != 3 || length($leading_digits) > 3) {
$output =~ s/\,/\./;
return eval($output);
}
}
if($output =~ /^\d+\.\d+/) {
# Swap commas and periods if periods are first
$output =~ tr/\.\,/\,\./;
}
# remove commas
$output =~ s/\,//g;
return eval($output);
Don't know if that is any better, but it is generalized.
Upvotes: 0
Reputation: 6435
Here is a somewhat shorter and more complete version of the function:
sub parse_decimal($) {
my $input = shift;
my %other = ("." => ",", "," => ".");
$input =~ s/[^\d,.]//g;
if ($input !~ /[,.]/) {
return &parse_with_separators($input, '.', ',');
} elsif ($input =~ /(\.).*(,)/ or $input =~ /(,).*(\.)/) { # Both separators present
return &parse_with_separators($input, $2, $1);
} elsif ($input =~ /([,.])$/ or $input =~ /^([,.])/) { # Number ends or begins with decimal separator
return &parse_with_separators($input, $1, $other{$1});
} elsif ($input =~ /\d{4}([,.])/ or $input =~ /([,.])\d{4}/) { # group of 4+ digits next to a separator
return &parse_with_separators($input, $1, $other{$1});
} elsif ($input =~ /([,.]).*\1/) { # More than one of the same separator
return &parse_with_separators($input, $other{$1}, $1);
} elsif ($input =~ /\d*([,.])\d{0,2}$/) { # Fewer than 2 digits to the right of the separator
return &parse_with_separators($input, $1, $other{$1});
} else { # Assume '.' is decimal separator and ',' is thousands separator
return &parse_with_separators($input, '.', ',');
}
}
Some important things:
Upvotes: 1
Reputation: 42684
This is like programs that automatically guess the character encoding of the input -- it might work sometimes, but in general is a very bad strategy that leads to buggy and confusing non-deterministic behavior.
For example, if you see "123,456", you don't have enough information to guess what that means.
So I would approach this with caution, and never use this technique for anything important.
Upvotes: 6