Reputation: 43
Whats the best way to summarize data from a file that has around 2 million records in Perl?
For eg: A file like this,
ABC|XYZ|DEF|EGH|100
ABC|XYZ|DEF|FGH|200
SDF|GHT|WWW|RTY|1000
SDF|GHT|WWW|TYU|2000
Needs to be summarized on the first 3 columns like this,
ABC|XYZ|DEF|300
SDF|GHT|WWW|3000
Chris
Upvotes: 4
Views: 1203
Reputation: 29844
1-2-3-4 I declare A CODE-GOLF WAR!!! (Okay, a reasonably readable code-golf dust-up.)
my %sums;
m/([^|]+\|[^|]+\|[^|]+).*?\|(\d+)/ and $sums{ $1 } += $2 while <>;
print join( "\n", ( map { "$_|$sums{$_}" } sort keys %sums ), '' );
Upvotes: 0
Reputation: 14
Sort to put all records with the same first 3 triplets next to each other. Iterate through and kick out a subtotal when a different set of triplets appears.
$prevKey="";
$subtotal=0;
open(INPUTFILE, "<$inFile");
@lines=<INPUTFILE>;
close (INPUTFILE);
open(OUTFILE, ">$outFile");
@sorted=sort(@lines);
foreach $line(@lines){
@parts=split(/\|/g, $line);
$value=pop(@parts);
$value-=0; #coerce string to a number
$key=$parts[0]."|".$parts[1]."|".$parts[2];
if($key ne $prevKey){
print OUTFILE "$prevKey|$subtotal\n";
$prevKey=$key;
$subtotal=0;
}
$subtotal+=$value;
}
close(OUTFILE);
If sorting 2 million chokes your box then you may have to put each record into a file based on the group and then do the subtotal for each file.
Upvotes: -1
Reputation: 26086
Presuming your input file has its records in separate lines.
perl -n -e 'chomp;@a=split/\|/;$h{join"|",splice@a,0,3}+=pop@a;END{print map{"$_: $h{$_}\n"}keys%h}' < inputfile
Upvotes: 0
Reputation: 46183
Assuming there are always five columns, the fifth of which is numeric, and you always want the first three columns to be the key...
use warnings;
use strict;
my %totals_hash;
while (<>)
{
chomp;
my @cols = split /\|/;
my $key = join '|', @cols[0..2];
$totals_hash{$key} += $cols[4];
}
foreach (sort keys %totals_hash)
{
print $_, '|', $totals_hash{$_}, "\n";
}
Upvotes: 3
Reputation: 454930
You can use a hash as:
my %hash;
while (<DATA>) {
chomp;
my @tmp = split/\|/; # split each line on |
my $value = pop @tmp; # last ele is the value
pop @tmp; # pop unwanted entry
my $key = join '|',@tmp; # join the remaining ele to form key
$hash{$key} += $value; # add value for this key
}
# print hash key-values.
for(sort keys %hash) {
print $_ . '|'.$hash{$_}."\n";
}
Upvotes: 2