Reputation: 317

grouping and summarizing the rows in a big text file using awk

I have a big text file like this example:

example:

chr11   314980  314981  63  IFITM1  -131
chr11   315025  315026  54  IFITM1  -86
chr5    315085  315086  118 AHRR    -53011
chr16   316087  316088  56  ITFG3   -86
chr16   316088  316089  90  ITFG3   -131
chr11   319672  319673  213 IFITM3  -131
chr11   319674  319675  514 IFITM3  -164

I want to group the rows based on the 6th column and sum the values from the 4th column for every group. the new file would have 2 columns. 1st column would be the group and the 2nd column would be sum (sum of values from column 4 from similar groups). the expected output would look like this:

expected output:

-131    366
-86     110
-53011  118
-164    514

I am trying to do that in awk using the following code.

sort myfile.txt | awk -F'\t' '{ sub(/..$/,"**",$6) }1' OFS='\t' | awk '{print $1 "\t" $2}' > outfile.txt

but actually it returns an empty file. do you know how to fix it?

Upvotes: 0

Answers (3)

oguz ismail

Reputation: 50760

If you're ok with sorted output, you don't need arrays:

sort -k6n file |
awk -F'\t' '
    grp != $6 {
        grp = $6
        printf "%s%s%s%s", sum, sep, grp, FS
        sum = 0
        sep = ORS
    } { sum += $4 } END { print sum }'

Upvotes: 1

Juan Diego Godoy Robles

Reputation: 14955

Use an associative array:

awk '{a[$NF]+=$4}END{for (i in a){print i, a[i]}}' file

Upvotes: 4

glenn jackman

Reputation: 246807

I have no idea what you are thinking with your code: why you are replacing the last 2 chars on the line with asterisks? why aren't you doing any addition anywhere? why do you sort (by column 1) first?

awk -F'\t' '
    {sum[$6] += $4} 
    END {for (key in sum) {print key, sum[key]}}
' file | column -t

Upvotes: 4

grouping and summarizing the rows in a big text file using awk

Answers (3)

Related Questions