Rodriguez J Mathew
Rodriguez J Mathew

Reputation: 107

How to sum up all other columns based on column 1?

I have an example csv file like below (but with way more columns numbering up to Sample 100 and several rows)

Genus,Sample1,Sample2,Sample3
Unclassified,0,1,44
Unclassified,0,0,392
Unclassified,0,0,0
Woeseia,0,0,76

and I would like to have a summed csv file as below where all the identical entries on column 1 are summed up

Genus,Sample1,Sample2,Sample3
Unclassified,0,1,436
Woeseia,0,0,76

I tried the following awk code but it didn't work

awk  -F "," 'function SP()  {n=split ($0, T); ID=$1}
         function PR()  {printf "%s", ID; for (i=2; i<=n; i++) printf "\t%s", T[i]; printf "\n"}

         NR==1          {SP();next}
         $1 != ID       {PR(); SP(); next}
                        {for (i=2; i<=NF; i++) T[i]+=$i}
         END            {PR()}
        ' Filename.csv

I am also aware of doing something like below but it is impractical when there are hundreds of columns. Any help here would be appreciated.

awk -F "," ' NR==1 {print; next} NF {a[$1]+=$2; b[$1]+=$3; c[$1]+=$4; d[$1]+=$5; e[$1]+=$6; f[$1]++} END {for(i in a)print i, a[i], b[i], c[i], d[i], e[i], f[i]} ' Filename.csv 

Upvotes: 0

Views: 51

Answers (2)

Fravadona
Fravadona

Reputation: 17055

Here's an other awk:

awk -v FS=',' -v OFS=',' '
    NR == 1 {
        print
        next
    }
    {
        ids[$1]
        for (i = 2; i <= NF; i++)
            sums[i "," $1] += $i
    }
    END {
        for (id in ids) {
            out = id
            for (i = 2; i <= NF; i++)
                out = out OFS sums[i "," id]
            print out
        }
    }
' Filename.csv
Genus,Sample1,Sample2,Sample3
Unclassified,0,1,436
Woeseia,0,0,76

You can also use a CSV-aware program that provides tools for data analysis.
Here's an example with Miller, which is available as a stand-alone executable:

IFS='' read -r csv_header < Filename.csv

mlr --csv \
    stats1 -a sum -g "${csv_header%%,*}" -f "${csv_header#*,}" \
    then rename -r '(.*)_sum,\1' \
    Filename.csv
Genus,Sample1,Sample2,Sample3
Unclassified,0,1,436
Woeseia,0,0,76

Upvotes: 1

RavinderSingh13
RavinderSingh13

Reputation: 133538

With your shown samples, please try following awk program. You need NOT to create these many arrays, you could easily do it with 1 or 2 here.

awk '
BEGIN { FS=OFS="," }
FNR==1{
  print
  next
}
{
  for(i=2;i<=NF;i++){
    arr1[$1]
    arr2[$1,i]+=$i
  }
}
END{
  for(i in arr1){
    printf("%s,",i)
    for(j=2;j<=NF;j++){
      printf("%s%s",arr2[i,j],j==NF?ORS:OFS)
    }
  }
}
'  Input_file

Output will be as follows:

Genus,Sample1,Sample2,Sample3
Unclassified,0,1,436
Woeseia,0,0,76

Explanation: Adding detailed explanation for above code.

awk '                         ##Starting awk program from here.
BEGIN { FS=OFS="," }          ##In BEGIN section setting FS and OFS as comma here.
FNR==1{                       ##Checking if this is first line then do following.
  print                       ##Printing current line.
  next                        ##next will skip further statements from here.
}
{
  for(i=2;i<=NF;i++){         ##Running for loop from 2nd field to till NF here.
    arr1[$1]                  ##Creating arr1 array with index of 1st field.
    arr2[$1,i]+=$i            ##Creating arr2 with index of 1st field and current field number and value is current field value which is keep adding into it.
  }
}
END{                          ##Starting END block for this program from here.
  for(i in arr1){             ##Traversing through arr1 all elements here one by one.
    printf("%s,",i)           ##Printing its current index here.
    for(j=2;j<=NF;j++){       ##Running for loop from 2nd field to till NF here.
      printf("%s%s",arr2[i,j],j==NF?ORS:OFS) ##Printing value of arr2 with index of i and j, printing new line if its last field.
    }
  }
}
'  Input_file                 ##Mentioning Input_file here.

Upvotes: 2

Related Questions