Reputation: 2219
I have huge text file of such format:
aaa bbb 1
aaa ccc 2
aaa ddd 3
bbb ww 1
bbb kio 3
I want to aggregate it and the result should be:
aaa bbb 1/6
aaa ccc 2/6
aaa ddd 3/6
bbb ww 1/4
bbb kio 3/4
3rd column - probability p(y|x)
How should I do that using awk, sed?
Upvotes: 2
Views: 279
Reputation: 58473
This might work for you:
awk 'func p(){for(x=0;x<c;x++)printf("%s/%d\n",l[x],t);k=$1;t=c=0};BEGIN{k=$1};$1!=k{p()};{l[c++]=$0;t+=$3};END{p()}' file
aaa bbb 1/6
aaa ccc 2/6
aaa ddd 3/6
bbb ww 1/4
bbb kio 3/4
N.B. Assumes file is pre-sorted by key.
Upvotes: 0
Reputation: 226486
You could do it in two passes. Generate a.tmp using:
{ total[$1] += $3}
END {for (group in total) {print group, total[group]}}
That creates a temporary file with the group totals:
bbb 4
aaa 6
Then make a second pass with:
BEGIN {
while ((getline line < "a.tmp") > 0) {
split(line, fields, " ")
group[fields[1]] = fields[2]
}
close("a.tmp")
}
{ printf("%s/%d\n", $0, group[$1]) }
That produces the output you're looking for:
aaa bbb 1/6
aaa ccc 2/6
aaa ddd 3/6
bbb ww 1/4
bbb kio 3/4
Upvotes: 0
Reputation: 140417
awk 'NR==FNR{a[$1]+=$3;next}{printf("%s/%d\n",$0,a[$1])}' ./infile ./infile
$ awk 'NR==FNR{a[$1]+=$3;next}{printf("%s/%d\n",$0,a[$1])}' ./infile ./infile
aaa bbb 1/6
aaa ccc 2/6
aaa ddd 3/6
bbb ww 1/4
bbb kio 3/4
Upvotes: 6