Reputation: 11
I'm quite new to awk.
I am trying to write a script that takes an input file, finds the sum of the third column, and then prints columns 1, 2, and then the normalized third column. However, when I do this, I only seem to be doing this for the last row of my input file. I think I am missing something about how 'END' works. Any tips?
Thanks!
BEGIN {
col= ARGV[2]
ARGV[2] = ""
}
{s1 += $3}
END { if (NR > 0){
print s1;
print $1, $2, $3/s1
}
}
INPUT:
0 2 8.98002e-05
1 0 5.66203e-05
2 2 2.20586e-05
3 2 5.31672e-05
4 2 2.17192e-07
5 26 3.67908e-06
6 1 1.0385e-05
7 1 7.78022e-05
8 0 5.47272e-05
9 1 6.34726e-05
10 1 0.000105879
11 1 4.77847e-05
12 0 3.05258e-05
13 0 5.53268e-05
14 1 7.8916e-05
15 1 3.02601e-05
16 1 3.81807e-05
s1: 0.000818803
OUTPUT:
0.000818803
0 2 0.109673
0.000818803
1 0 0.0691501
0.000818803
2 2 0.0269401
0.000818803
3 2 0.0649328
0.000818803
4 2 0.000265256
0.000818803
5 26 0.00449324
0.000818803
6 1 0.0126831
0.000818803
7 1 0.0950194
0.000818803
8 0 0.0668381
0.000818803
9 1 0.0775188
0.000818803
10 1 0.129309
0.000818803
11 1 0.0583592
0.000818803
12 0 0.037281
0.000818803
13 0 0.0675703
0.000818803
14 1 0.0963797
0.000818803
15 1 0.0369565
0.000818803
16 1 0.0466299
Upvotes: 1
Views: 402
Reputation: 10875
For this, one way or another, you'll have to make two passes through the records. One way is to read the file itself twice as in the first method shown below.
The first pass simply accumulates the total of column 3 in s1
. The second pass prints the first two columns with the normalized third.
Note that you have to provide the file twice on the command line so that awk processes it twice!
$ awk 'NR == FNR {s1 += $3; next} {print $1, $2, $3/s1}' file file
0 2 0.109673
1 0 0.0691501
2 2 0.0269401
3 2 0.0649329
4 2 0.000265256
5 26 0.00449324
6 1 0.0126832
7 1 0.0950195
8 0 0.0668381
9 1 0.0775188
10 1 0.12931
11 1 0.0583592
12 0 0.037281
13 0 0.0675704
14 1 0.0963798
15 1 0.0369565
16 1 0.0466299
Another way, which is closer to where you were headed with your attempt, is to only read the file once, keeping all the row information in memory while you simultaneously sum column 3.
Then in the END
block which is run after all records are read and the sum is fully accumulated, you iterate through the array to print out the results.
awk ' { s1 += $3; a[NR] = $1 OFS $2; b[NR] = $3 }
END { for (i=1; i<=NR; ++i) print a[i], b[i] / s1 }' file
This second method has the obvious downside of using much more memory --- in fact with a very large file this approach may not even be feasible.
If you're not already familiar with the NR == FNR
construct see What is "NR==FNR" in awk? . Also see the section on "Two-file processing" at https://backreference.org/2010/02/10/idiomatic-awk/ .
Upvotes: 1