Reputation: 673
My data is like the folowing
1,A-1-LYS-N,0.0704,0.1339,0.1733
9,A-1-LYS-NZ,0.0514,0.0543,0.1181
2,A-1-LYS-CA,0.0418,0.0664,0.125
7,A-1-LYS-CD,0.0406,0.0777,0.119
8,A-1-LYS-CE,0.0405,0.078,0.1206
4,A-1-LYS-O,0.0377,0.037,0.0903
6,A-1-LYS-CG,0.0373,0.0775,0.1472
5,A-1-LYS-CB,0.0233,0.0522,0.1103
3,A-1-LYS-C,0.0136,0.0256,0.0832
13,A-2-VAL-O,0.0471,0.0662,0.0647
12,A-2-VAL-C,0.0453,0.0681,0.1098
10,A-2-VAL-N,0.038,0.055,0.0657
15,A-2-VAL-CG1,0.0274,0.0435,0.0459
11,A-2-VAL-CA,0.0253,0.0485,0.0658
14,A-2-VAL-CB,0.0159,0.0049,0.0064
16,A-2-VAL-CG2,0.0149,0.0333,0.0604
27,A-3-PHE-CZ,0.0644,0.061,0.1214
26,A-3-PHE-CE2,0.0471,0.0449,0.0915
25,A-3-PHE-CE1,0.0458,0.0509,0.0653
22,A-3-PHE-CG,0.0343,0.0515,0.0878
19,A-3-PHE-C,0.0338,0.0772,0.0853
20,A-3-PHE-O,0.0269,0.0566,0.0515
24,A-3-PHE-CD2,0.025,0.0295,0.0785
18,A-3-PHE-CA,0.0244,0.0557,0.0637
21,A-3-PHE-CB,0.0225,0.0366,0.0369
17,A-3-PHE-N,0.0135,0.056,0.0997
23,A-3-PHE-CD1,0.0049,0.0204,0.0101
This goes until A-LEU-129
. I want to calculate the average of column 3, 4 and 5 per group of data. Every group of data is separated by a blank line. How do I do that with awk? Output should be in another file like this
A-1-LYS: 0.039, 0.066, 0.120
A-2-VAL: 0.030, 0.045, 0.059
A-3-PHE: 0.031, 0.049, 0.071
Or preferably
1 0.039 A-1-LYS
2 0.066 A-1-LYS
3 0.120 A-1-LYS
1 0.030 A-2-VAL
2 0.045 A-2-VAL
3 0.059 A-2-VAL
1 0.031 A-3-PHE
2 0.049 A-3-PHE
3 0.071 A-3-PHE
Thank you
I know awk calculates average per column, but I am not sure if this awk can handle this or maybe it is better to use R.
Upvotes: 0
Views: 418
Reputation: 6759
#! /bin/bash
awk -F, '{
if($0 == "") {
--NR
if (group == "" || flag == 1){ next }
printf "%s: %0.3f, %0.3f, %0.3f\n", group, sum1/NR, sum2/NR, sum3/NR
NR=0
sum1=sum2=sum3=0
flag=1
next
}
if (NR == 1) {
flag=sum1=sum2=sum3=0
split($2, arr, "-")
group=arr[1]"-"arr[2]"-"arr[3]
}
sum1+=$3
sum2+=$4
sum3+=$5
} END {if(flag==0){--NR ; printf "%s: %0.3f, %0.3f, %0.3f\n", group, sum1/NR, sum2/NR, sum3/NR}}' file
Other preferred format:
#! /bin/bash
awk -F, '{
if($0 == "") {
--NR
if (group == "" || flag == 1){ next }
print "1", sum1/NR, group
print "2", sum2/NR, group
print "3", sum3/NR, group
NR=0
sum1=sum2=sum3=0
flag=1
next
}
if (NR == 1) {
flag=0
sum1=0;sum2=0;sum3=0
split($2, arr, "-")
group=arr[1]"-"arr[2]"-"arr[3]
}
sum1+=$3
sum2+=$4
sum3+=$5
} END {if(flag==0){NR=NR-1 ; print group, sum1/NR, sum2/NR, sum3/NR}}' file
In case if you want numbers to be displayed only upto 3 decimal places,
#! /bin/bash
awk -F, '{
if($0 == "") {
--NR
if (group == "" || flag == 1){ next }
print "1", substr(sum1/NR, 0, 5), group
print "2", substr(sum2/NR, 0, 5), group
print "3", substr(sum3/NR, 0, 5), group
NR=0
sum1=sum2=sum3=0
flag=1
next
}
if (NR == 1) {
flag=0
sum1=0;sum2=0;sum3=0
split($2, arr, "-")
group=arr[1]"-"arr[2]"-"arr[3]
}
sum1+=$3
sum2+=$4
sum3+=$5
} END {if(flag==0){NR=NR-1 ; print group, sum1/NR, sum2/NR, sum3/NR}}' file
Try this, also if you can assure you have a blank line at the end of your file, you won't need END
block and flag
.
I have no knowledge about R
so no idea if R
would be better.
Upvotes: 3
Reputation: 407
This solution works in R.
# Create dataframe
df <- data.frame(
x1 = c(1, 9, 27, 26),
x2 = c('A-1-LYS-N', 'A-1-LYS-NZ', 'A-LEU-129-CZ', 'A-LEU-129-CE2'),
x3 = c(0.0704, 0.0514, 0.0644, 0.0471),
x4 = c(0.1339, 0.0543, 0.061, 0.0449),
x5 = c(0.1733, 0.1181, 0.1214, 0.0915)
)
# Create groupings
library(stringr);
df$grouping <- word(gsub('-',' ',df$x2),1,3); df$grouping <- gsub(' ','-',df$grouping)
# Calculate means by groupings
library(dplyr)
summarize(group_by(df, grouping), mean(x3), mean(x4), mean(x5), n())
Upvotes: 2