Reputation: 107
I have an example csv file like below (but with way more columns numbering up to Sample 100 and several rows)
Genus,Sample1,Sample2,Sample3
Unclassified,0,1,44
Unclassified,0,0,392
Unclassified,0,0,0
Woeseia,0,0,76
and I would like to have a summed csv file as below where all the identical entries on column 1 are summed up
Genus,Sample1,Sample2,Sample3
Unclassified,0,1,436
Woeseia,0,0,76
I tried the following awk code but it didn't work
awk -F "," 'function SP() {n=split ($0, T); ID=$1}
function PR() {printf "%s", ID; for (i=2; i<=n; i++) printf "\t%s", T[i]; printf "\n"}
NR==1 {SP();next}
$1 != ID {PR(); SP(); next}
{for (i=2; i<=NF; i++) T[i]+=$i}
END {PR()}
' Filename.csv
I am also aware of doing something like below but it is impractical when there are hundreds of columns. Any help here would be appreciated.
awk -F "," ' NR==1 {print; next} NF {a[$1]+=$2; b[$1]+=$3; c[$1]+=$4; d[$1]+=$5; e[$1]+=$6; f[$1]++} END {for(i in a)print i, a[i], b[i], c[i], d[i], e[i], f[i]} ' Filename.csv
Upvotes: 0
Views: 51
Reputation: 17055
Here's an other awk
:
awk -v FS=',' -v OFS=',' '
NR == 1 {
print
next
}
{
ids[$1]
for (i = 2; i <= NF; i++)
sums[i "," $1] += $i
}
END {
for (id in ids) {
out = id
for (i = 2; i <= NF; i++)
out = out OFS sums[i "," id]
print out
}
}
' Filename.csv
Genus,Sample1,Sample2,Sample3
Unclassified,0,1,436
Woeseia,0,0,76
You can also use a CSV-aware program that provides tools for data analysis.
Here's an example with Miller, which is available as a stand-alone executable:
IFS='' read -r csv_header < Filename.csv
mlr --csv \
stats1 -a sum -g "${csv_header%%,*}" -f "${csv_header#*,}" \
then rename -r '(.*)_sum,\1' \
Filename.csv
Genus,Sample1,Sample2,Sample3
Unclassified,0,1,436
Woeseia,0,0,76
Upvotes: 1
Reputation: 133538
With your shown samples, please try following awk
program. You need NOT to create these many arrays, you could easily do it with 1 or 2 here.
awk '
BEGIN { FS=OFS="," }
FNR==1{
print
next
}
{
for(i=2;i<=NF;i++){
arr1[$1]
arr2[$1,i]+=$i
}
}
END{
for(i in arr1){
printf("%s,",i)
for(j=2;j<=NF;j++){
printf("%s%s",arr2[i,j],j==NF?ORS:OFS)
}
}
}
' Input_file
Output will be as follows:
Genus,Sample1,Sample2,Sample3
Unclassified,0,1,436
Woeseia,0,0,76
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk program from here.
BEGIN { FS=OFS="," } ##In BEGIN section setting FS and OFS as comma here.
FNR==1{ ##Checking if this is first line then do following.
print ##Printing current line.
next ##next will skip further statements from here.
}
{
for(i=2;i<=NF;i++){ ##Running for loop from 2nd field to till NF here.
arr1[$1] ##Creating arr1 array with index of 1st field.
arr2[$1,i]+=$i ##Creating arr2 with index of 1st field and current field number and value is current field value which is keep adding into it.
}
}
END{ ##Starting END block for this program from here.
for(i in arr1){ ##Traversing through arr1 all elements here one by one.
printf("%s,",i) ##Printing its current index here.
for(j=2;j<=NF;j++){ ##Running for loop from 2nd field to till NF here.
printf("%s%s",arr2[i,j],j==NF?ORS:OFS) ##Printing value of arr2 with index of i and j, printing new line if its last field.
}
}
}
' Input_file ##Mentioning Input_file here.
Upvotes: 2