knorv
knorv

Reputation: 50127

Regexp/perl code for handling both dots and commas as valid decimal separators

I'm trying to create a method that provides "best effort" parsing of decimal inputs in cases where I do not know which of these two mutually exclusive ways of writing numbers the end-user is using:

The method is implemented as parse_decimal(..) in the code below. Furthermore, I've defined 20 test cases that show how the heuristics of the method should work.

While the code below passes the tests it is quite horrible and unreadable. I'm sure there is a more compact and readable way to implement the method. Possibly including smarter use of regexpes.

My question is simply: Given the code below and the test-cases, how would you improve parse_decimal(...) to make it more compact and readable while still passing the tests?

Clarifications:

The code in question including the test cases:

#!/usr/bin/perl -wT

use strict;
use warnings;
use Test::More tests => 20;

ok(&parse_decimal("1,234,567") == 1234567);
ok(&parse_decimal("1,234567") == 1.234567);
ok(&parse_decimal("1.234.567") == 1234567);
ok(&parse_decimal("1.234567") == 1.234567);
ok(&parse_decimal("12,345") == 12345);
ok(&parse_decimal("12,345,678") == 12345678);
ok(&parse_decimal("12,345.67") == 12345.67);
ok(&parse_decimal("12,34567") == 12.34567);
ok(&parse_decimal("12.34") == 12.34);
ok(&parse_decimal("12.345") == 12345);
ok(&parse_decimal("12.345,67") == 12345.67);
ok(&parse_decimal("12.345.678") == 12345678);
ok(&parse_decimal("12.34567") == 12.34567);
ok(&parse_decimal("123,4567") == 123.4567);
ok(&parse_decimal("123.4567") == 123.4567);
ok(&parse_decimal("1234,567") == 1234.567);
ok(&parse_decimal("1234.567") == 1234.567);
ok(&parse_decimal("12345") == 12345);
ok(&parse_decimal("12345,67") == 12345.67);
ok(&parse_decimal("1234567") == 1234567);

sub parse_decimal($) {
    my $input = shift;
    $input =~ s/[^\d,\.]//g;
    if ($input !~ /[,\.]/) {
        return &parse_with_separators($input, '.', ',');
    } elsif ($input =~ /\d,\d+\.\d/) {
        return &parse_with_separators($input, '.', ',');
    } elsif ($input =~ /\d\.\d+,\d/) {
        return &parse_with_separators($input, ',', '.');
    } elsif ($input =~ /\d\.\d+\.\d/) {
        return &parse_with_separators($input, ',', '.');
    } elsif ($input =~ /\d,\d+,\d/) {
        return &parse_with_separators($input, '.', ',');
    } elsif ($input =~ /\d{4},\d/) {
        return &parse_with_separators($input, ',', '.');
    } elsif ($input =~ /\d{4}\.\d/) {
        return &parse_with_separators($input, '.', ',');
    } elsif ($input =~ /\d,\d{3}$/) {
        return &parse_with_separators($input, '.', ',');
    } elsif ($input =~ /\d\.\d{3}$/) {
        return &parse_with_separators($input, ',', '.');
    } elsif ($input =~ /\d,\d/) {
        return &parse_with_separators($input, ',', '.');
    } elsif ($input =~ /\d\.\d/) {
        return &parse_with_separators($input, '.', ',');
    } else {
        return &parse_with_separators($input, '.', ',');
    }
}

sub parse_with_separators($$$) {
    my $input = shift;
    my $decimal_separator = shift;
    my $thousand_separator = shift;
    my $output = $input;
    $output =~ s/\Q${thousand_separator}\E//g;
    $output =~ s/\Q${decimal_separator}\E/./g;
    return $output;
}

Upvotes: 0

Views: 2581

Answers (5)

Leonardo Herrera
Leonardo Herrera

Reputation: 8406

Trying to guess the locale of anything is always an ongoing effort, at best. What are you using this function for? The following tests look simply wrong to me:

ok(&parse_decimal("12.34") == 12.34);
ok(&parse_decimal("12.345") == 12345);

If I were parsing a single document with values on it I would be very irritated to find this result.

I would design this function with some knobs and dials in the package to use either the locale information (using localeconv()) or ad-hoc values (like in this answer.)

Edit: Okay, let me try to explain it better. For "single source" I mean a context or scope delimiter. I know you can import from different sources; that's the very nature of importing data. We also know that we cannot know beforehand the encoding of these different sources.

What I would do is to is to do a preliminary scan of the file being imported (just taking a sample, not reading it whole) and check the numeric values. If I can determine the locale from the sample, then I would try to import the whole file using the same locale. To me, one file is a single source, and I wouldn't expect it to suddenly change its locale.

That's why I would ask again: what is the purpose of this program?

Upvotes: 0

brian d foy
brian d foy

Reputation: 132886

The idea in these problems is to look at the code and figure out where you typed anything twice. When you see that, work to remove it. My program handles everything in your test data, and I don't have to repeat program logic structures to do it. That lets me focus on the data rather than program flow.

First, let's clean up your tests. You really have a set of pairs that you want to test, so let's put them into a data structure. You can add or remove items from the data structure as you like, and the tests will automatically adjust:

use Test::More 'no_plan';

my @pairs = (
     #  got          expect
    [ "1,234,567",  1234567  ],
    [ "1,234567",   1.234567 ],
    [ "1.234.567",  1234567  ],
    [ "1.234567",   1.234567 ],
    [ "12,345",     12345    ],
    [ "12,345,678", 12345678 ],
    [ "12,345.67",  12345.67 ],
    [ "12,34567",   12.34567 ],
    [ "12.34",      12.34    ],
    [ "12.345",     12345    ],  # odd case!
    [ "12.345,67",  12345.67 ],
    [ "12.345.678", 12345678 ],
    [ "12.34567",   12.34567 ],
    [ "123,4567",   123.4567 ],
    [ "123.4567",   123.4567 ],
    [ "1234,567",   1234.567 ],
    [ "1234.567",   1234.567 ],
    [ "12345",      12345    ],
    [ "12345,67",   12345.67 ],
    [ "1234567",    1234567  ],
);

Now that you have it in a data structure, your long line of tests reduces to a short foreach loop:

foreach my $pair ( @pairs ) {
     my( $original, $expected ) = @$pair;
     my $got = parse_number( $original );
     is( $got, $expected, "$original translates to $expected" );
     }

The parse_number routine likewise condenses into this simple code. Your trick is to find out what you are doing over and over again in the source and not do that. Instead of trying to figure out weird calling conventions and long chains of conditionals, I normalize the data. I figure out which cases are odd, then turn them into not-odd cases. In this code, I condense all of the knowledge about the separators into a handful of regexes and return one of two possible lists to show me what the thousands separator and decimal separator are. Once I have that, I remove the thousands separator completely and make the decimal separator the full stop. As I find more cases, I merely add a regex that returns true for that case:

sub parse_number
    {
    my $string = shift;

    my( $separator, $decimal ) = do {
        local $_ = $string;
        if( 
            /\.\d\d\d\./           || # two dots
            /\.\d\d\d,/            || # dot before comma
            /,\d{4,}/              || # comma with many following digits
            /\d{4,},/              || # comma with many leading digits
            /^\d{1,3}\.\d\d\d\z/   || # odd case of 123.456
            0
            )
            { qw( . , ) }
        else { qw( , . ) }      
        };

    $string =~ s/\Q$separator//g;
    $string =~ s/\Q$decimal/./;

    $string;
    }

This is the sort of thing I talk about in the dynamic subroutines chapter of Mastering Perl. Although I won't go into it here, I would probably turn that series of regexes into a pipeline of some sort and use a grep.

This is just the part of the program that passes your tests. I'd add another step to verify that the number is an expected format to deal with dirty data, but that's not so hard and is just a simple matter of programming.

Upvotes: 3

Jeff B
Jeff B

Reputation: 30099

To answer your question, not questioning the validity of the application,

Your algorithm boils down to the following pseudo-code:

if (num_separators == 1) {
    if (trailing_digits != 3 || leading_digits > 3) {
         # Replace comma with period
         return output
    }
}

# Replace first separator with nothing, and second with '.'

return output

Which is the following in perl:

my $output = $input;

my $num_separators = (scalar split /[\,\.]/, $input) - 1;

if($num_separators == 1) {
    my ($leading_digits, $trailing_digits) = split /[\,\.]/, $input;

    if(length($trailing_digits) != 3 || length($leading_digits) > 3) {
        $output =~ s/\,/\./;
        return eval($output);
    }
}

if($output =~ /^\d+\.\d+/) {
    # Swap commas and periods if periods are first
    $output =~ tr/\.\,/\,\./;
}

# remove commas
$output =~ s/\,//g;

return eval($output);

Don't know if that is any better, but it is generalized.

Upvotes: 0

Eric
Eric

Reputation: 6435

Here is a somewhat shorter and more complete version of the function:

sub parse_decimal($) {
    my $input = shift;
    my %other = ("." => ",", "," => ".");
    $input =~ s/[^\d,.]//g;
    if ($input !~ /[,.]/) {
        return &parse_with_separators($input, '.', ',');
    } elsif ($input =~ /(\.).*(,)/ or $input =~ /(,).*(\.)/) { # Both separators present
        return &parse_with_separators($input, $2, $1);
    } elsif ($input =~ /([,.])$/ or $input =~ /^([,.])/) { # Number ends or begins with decimal separator
        return &parse_with_separators($input, $1, $other{$1});
    } elsif ($input =~ /\d{4}([,.])/ or $input =~ /([,.])\d{4}/) { # group of 4+ digits next to a separator
        return &parse_with_separators($input, $1, $other{$1});
    } elsif ($input =~ /([,.]).*\1/) { # More than one of the same separator
        return &parse_with_separators($input, $other{$1}, $1);
    } elsif ($input =~ /\d*([,.])\d{0,2}$/) { # Fewer than 2 digits to the right of the separator
        return &parse_with_separators($input, $1, $other{$1});
    } else { # Assume '.' is decimal separator and ',' is thousands separator
        return &parse_with_separators($input, '.', ',');
    }
}

Some important things:

  • Your "12.345" test case is also ambiguous. Anything of the form \d{1,3}[,.]\d\d\d is ambiguous.
  • This function handles numbers of the form ".123" "123."
  • This function makes terrible, terrible assumptions about the input being well-formed.

Upvotes: 1

jrockway
jrockway

Reputation: 42684

This is like programs that automatically guess the character encoding of the input -- it might work sometimes, but in general is a very bad strategy that leads to buggy and confusing non-deterministic behavior.

For example, if you see "123,456", you don't have enough information to guess what that means.

So I would approach this with caution, and never use this technique for anything important.

Upvotes: 6

Related Questions