Ivri
Ivri

Reputation: 2219

Column aggregation in linux

I have huge text file of such format:

aaa bbb 1      
aaa ccc 2      
aaa ddd 3      
bbb ww 1      
bbb kio 3      

I want to aggregate it and the result should be:

aaa bbb 1/6  
aaa ccc 2/6  
aaa ddd 3/6  
bbb ww 1/4  
bbb kio 3/4  

3rd column - probability p(y|x)

How should I do that using awk, sed?

Upvotes: 2

Views: 279

Answers (3)

potong
potong

Reputation: 58473

This might work for you:

awk 'func p(){for(x=0;x<c;x++)printf("%s/%d\n",l[x],t);k=$1;t=c=0};BEGIN{k=$1};$1!=k{p()};{l[c++]=$0;t+=$3};END{p()}' file
aaa bbb 1/6
aaa ccc 2/6
aaa ddd 3/6
bbb ww 1/4
bbb kio 3/4

N.B. Assumes file is pre-sorted by key.

Upvotes: 0

Raymond Hettinger
Raymond Hettinger

Reputation: 226486

You could do it in two passes. Generate a.tmp using:

{ total[$1] += $3}
END {for (group in total) {print group, total[group]}}

That creates a temporary file with the group totals:

bbb 4
aaa 6

Then make a second pass with:

BEGIN {
    while ((getline line < "a.tmp") > 0) {
        split(line, fields, " ")
        group[fields[1]] = fields[2]
    }
    close("a.tmp")
}
{   printf("%s/%d\n", $0, group[$1]) }

That produces the output you're looking for:

aaa bbb 1/6
aaa ccc 2/6
aaa ddd 3/6
bbb ww 1/4
bbb kio 3/4

Upvotes: 0

SiegeX
SiegeX

Reputation: 140417

awk 'NR==FNR{a[$1]+=$3;next}{printf("%s/%d\n",$0,a[$1])}' ./infile ./infile

Output

$ awk 'NR==FNR{a[$1]+=$3;next}{printf("%s/%d\n",$0,a[$1])}' ./infile ./infile
aaa bbb 1/6
aaa ccc 2/6
aaa ddd 3/6
bbb ww 1/4
bbb kio 3/4

Upvotes: 6

Related Questions