Reputation: 11
I have a few thousand files with data as such:
bash$ cat somefile0001.csv
col1;col2;col3; ..... ;col10
2.34;0.19;6.40; ..... ;4.20
3.8;2.45;2.20; ..... ;5.09E+003
Basically, it's a 10x301 feild .csv file that includes a header file at the top deliminated by semi-colons ( didn't include the hole thing for brevity ).
So My goal is to change the scientific notation to decimal numbers average all the columns together, and then out put the column header with the column average to a new csv file, and then to this to thousands of files.
I already have working code to parse through all the files, I just can't seem to get the part to get the averaging to work
#!/bin/bash
filename=csvfile.csv
i=1
runningsum=0
echo ""> $filename.tmp.$i
tmptrnfrm=$(cut -f$i -d ';' $filename)
tmpfilehold=$filename.tmp.$i
echo "$tmptrnfrm" >> $tmpfilehold
trnsfrmcount=0
for j in $(cat $tmpfilehold)
do
if [[ $trnsfrmcount = 0 ]]]
then
echo -n "Iteration $trnsfrmcount:"
echo "$j" #>> $tmpfilehold
trnsfrmcount=$[$trnsfrmcount+1]
elif [[ $trnsfrmcount < 301 ]]
then
if [[ $(echo $j | sed 's/[0-9].[0-9][0-9]E+[0-9]/arbitrarystring/' ) == arbitrarystring ]]
then
tempj=$(printf "%0f" $j)
runningsum=$(echo '$runningsum + $tempj' | bc)
echo "$j" #>> tmpfilehold
trnsfrmcount=$[$trnsfrmcount+1]
else
echo "preruns: $runningsum"
runningsum=$(echo '$runningsum + $j' | bc)
echo "$j," #>> $tmpfilehold
echo "the running sum is: $runningsum"
trnsfrmcount=$[$trnsfrmcount+1]
fi
fi
done
totalz=$(echo '$runningsum / 300' | bc)
echo "here is the total"
echo "$totalz"
exit 0
I know it's kinda messy, I put a whole lot of extra strings to stdout to see what was happening while running. I would like to do this in perl, but I am just learning and know that this can be done with bash, and also I do not have access to the CSV module and no way to install it (otherwise it might be really easy).
Any help is greatly appreciated.
Upvotes: 1
Views: 967
Reputation: 866
Tabulator is a set of unix command line tools to work with delimited files that have header lines. Here is an example to compute the average of the first three columns:
tblred -d';' -su -c'avg1_col=avg(col1),avg_col2=avg(col2)' somefile00001.csv
produces
avg1_col;avg_col2
3.07;1.32
Upvotes: 0
Reputation: 1318
Here's a basic perl script that should do what you want. I haven't tested it.
#!/usr/bin/perl
use strict;
use warnings;
my $infile = shift;
my $outfile = shift || $infile . ".new";
my $header = "";
my $count = 0;
my @sums = ();
my @means = ();
open my $fin, '<', $infile or die $!;
$header = <$fin>;
@sums = map { 0 } split ";", $header; # to initialize @sums;
while ( my $line = <$fin> ) {
chomp $line;
my @fields = split ";", $line;
for ( my $i = 0 ; $i < scalar @fields ; $i++ ) {
# use sprintf to convert to decimal notation
# if we think we are using scientific notation
if ( $fields[$i] =~ m/E/i ) {
$sums[$i] += sprintf( "%.2f", $fields[$i] );
} else {
$sums[$i] += $fields[$i];
}
}
$count++;
}
close $fin;
exit 1 if $count == 0;
# calculate averages
@means = map { sprintf( "%.2f", $_ / $count ) } @sums;
# intentionally left out writing to a file
print $header;
print join( ";", @means ) . "\n";
Upvotes: 1