Lawrence Noronha
Lawrence Noronha

Reputation: 49

How do I detect multiple duplicate fields in a file using Perl?

I have a bunch of orders for NETFLIX in my brokerage account. I inadvertently entered two duplicate gtc Sell orders on 1/5 and 1/6. How do I detect it using a Perl script?

 Buy NFLX     50 @  315.00  Reg-Acct Fake
 Buy NFLX     50 @  317.50  Reg-Acct OPEN              01/13/15
Sell NFLX     50 @  345.00  Reg-Acct OPEN              01/05/15
Sell NFLX     50 @  345.00  Reg-Acct OPEN              01/06/15
Sell NFLX     50 @  362.00  Reg-Acct OPEN              11/25/14
...
Sell NFLX     50 @  345.00  IRA-Acct OPEN              09/15/14

I want the script to spit out just these two lines, judged by fields[0] through fields[6] being identical.

Sell NFLX     50 @  345.00  Reg-Acct OPEN              01/05/15
Sell NFLX     50 @  345.00  Reg-Acct OPEN              01/06/15

I would prefer a simple script (i.e. no one-liner, no hash) as I am new to Perl.

Thanks, Larry

Upvotes: 0

Views: 102

Answers (2)

7stud
7stud

Reputation: 48599

I would prefer a simple script (no hash)

Ugh. Missed the no hash. Unfortunately, simple and no hash are opposing goals--not to mention no hash means not efficient, i.e. slow. See code at bottom for how you should do it. In the meantime, you'll need parallel arrays:

use strict;
use warnings;
use 5.016;
use Data::Dumper;

my @orders;
my @counts;

my $fname = 'data3.txt';

open my $ORDERSFILE, '<', $fname
    or die "Couldn't open $fname: $!";

LINE:
while (my $line = <$ORDERSFILE>) {
    my @pieces = split ' ', $line;
    my $date = pop @pieces;
    my $order = join ' ', @pieces;

    if (not @orders) { #then length of @orders is 0
        $orders[0] = $order;
        $counts[0] = 1;
        next LINE;
    }

    for my $i (0..$#orders) {
        if ($orders[$i] eq $order) {
            $counts[$i]++;
            next LINE;
        }
    }
    #If execution reaches here, then the order wasn't found in the array...
    my $i = $#counts + 1;
    $orders[$i] = $order;
    $counts[$i] = 1
}

say Dumper(\@orders);
say Dumper(\@counts);


for my $i (0..$#counts) {
    if ($counts[$i] > 1) {
        say "($counts[$i]) $orders[$i]";
    }
}

--output:--
$VAR1 = [
          'Buy NFLX 50 @ 315.00 Reg-Acct',
          'Buy NFLX 50 @ 317.50 Reg-Acct OPEN',
          'Sell NFLX 50 @ 345.00 Reg-Acct OPEN',
          'Sell NFLX 50 @ 362.00 Reg-Acct OPEN',
          'Sell NFLX 50 @ 345.00 IRA-Acct OPEN'
        ];

$VAR1 = [
          1,
          1,
          2,
          1,
          1
        ];

(2) Sell NFLX 50 @ 345.00 Reg-Acct OPEN

Here are some better solutions:

use strict;
use warnings;
use 5.016;
use Data::Dumper;

my %dates_for;   #A key will be an order; a value will be a reference to an array of dates.

while (my $line = <DATA>) {
    my @pieces = split ' ', $line;
    my $date = pop @pieces;
    my $order = join ' ', @pieces;

    push @{$dates_for{$order}}, $date;  #autovivification (see explanation below)
}

say Dumper(\%dates_for);

my @dates;

for my $order (keys %dates_for) {
    @dates = @{$dates_for{$order}};
    my $dup_count = @dates;

    if ($dup_count > 1) {
        say "($dup_count) $order";
        say "   $_" for @dates;
    }
}


__DATA__
 Buy NFLX     50 @  315.00  Reg-Acct Fake
 Buy NFLX     50 @  317.50  Reg-Acct OPEN              01/13/15
Sell NFLX     50 @  345.00  Reg-Acct OPEN              01/05/15
Sell NFLX     50 @  345.00  Reg-Acct OPEN              01/06/15
Sell NFLX     50 @  362.00  Reg-Acct OPEN              11/25/14
Sell NFLX     50 @  345.00  IRA-Acct OPEN              09/15/14  


--output:--
$VAR1 = {
          'Sell NFLX 50 @ 345.00 IRA-Acct OPEN' => [
                                                     '09/15/14'
                                                   ],
          'Sell NFLX 50 @ 345.00 Reg-Acct OPEN' => [
                                                     '01/05/15',
                                                     '01/06/15'
                                                   ],
          'Buy NFLX 50 @ 317.50 Reg-Acct OPEN' => [
                                                    '01/13/15'
                                                  ],
          'Buy NFLX 50 @ 315.00 Reg-Acct' => [
                                               'Fake'
                                             ],
          'Sell NFLX 50 @ 362.00 Reg-Acct OPEN' => [
                                                     '11/25/14'
                                                   ]
        };

(2) Sell NFLX 50 @ 345.00 Reg-Acct OPEN
   01/05/15
   01/06/15

When an undefined variable is dereferenced, it gets silently upgraded to an array or hash reference (depending of the type of the dereferencing). This behaviour is called autovivification and usually does what you mean (e.g. when you store a value)....

http://search.cpan.org/~vpit/autovivification-0.14/lib/autovivification.pm

For fixed width columns, it's more efficient to use unpack():

use strict;
use warnings;
use 5.016;
use Data::Dumper;

my $fname = 'data3.txt';

open my $ORDERSFILE, '<', $fname
    or die "Couldn't open $fname: $!";

my %dates_for;

while (my $line = <$ORDERSFILE>) {
    my ($order, $date) = unpack 'A41 @55 A*', $line;   #see explanation below
    push @{$dates_for{$order}}, $date;
}

close $ORDERSFILE;

say Dumper(\%dates_for);

my @dates;

for my $order (keys %dates_for) {
    @dates = @{$dates_for{$order}};

    if (@dates > 1) {
        my $dup_count = @dates;
        say "($dup_count) $order";
        say "   $_" for @dates;
    }
}

--output:--
$VAR1 = {
          ' Buy NFLX     50 @  317.50  Reg-Acct OPEN' => [
                                                           '01/13/15'
                                                         ],
          'Sell NFLX     50 @  362.00  Reg-Acct OPEN' => [
                                                           '11/25/14'
                                                         ],
          'Sell NFLX     50 @  345.00  Reg-Acct OPEN' => [
                                                           '01/05/15',
                                                           '01/06/15'
                                                         ],
          ' Buy NFLX     50 @  315.00  Reg-Acct Fake' => [
                                                           ''
                                                         ],
          'Sell NFLX     50 @  345.00  IRA-Acct OPEN' => [
                                                           '09/15/14'
                                                         ]
        };

(2) Sell NFLX     50 @  345.00  Reg-Acct OPEN
   01/05/15
   01/06/15

A41 @55 A* => extract 41 characters(A),
..............................skip to position 55(@55),
..............................extract the remaining characters(A*)

You can skip to any position you want, forwards and backwards, which means you can extract pieces in any order you want.

Upvotes: 0

ysth
ysth

Reputation: 98388

I know you said no one-liner, but in case you just meant no perl one-liners:

sort filename|rev|uniq -D -f 1|rev

Upvotes: 1

Related Questions