Reputation: 788
For the following input data,
Chr C rsid D A1 A2 ID1_AA ID1_AB ID1_BB ID2_AA ID2_AB ID2_BB ID3_AA ID3_AB ID3_BB ID4_AA ID4_AB ID4_BB ID5_AA ID5_AB ID5_BB
10 p rsid1 q A G 0.00 0.85 0.15 0.70 0.10 0.20 0.40 0.50 0.10 0.30 0.30 0.40 0.10 0.20 0.80
10 p rsid2 q C T 0.90 0.10 0.00 0.80 0.10 0.10 0.70 0.10 0.20 0.30 0.40 0.30 0.30 0.20 0.40
10 p rsid3 q A G 0.40 0.50 0.10 0.80 0.20 0.00 0.20 0.30 0.50 0.50 0.30 0.20 0.20 0.30 0.40
I need to generate the following output data.
rsid ID1 ID2 ID3 ID4 ID5
rsid1 2.15 1.50 1.70 2.10 2.90
rsid2 1.10 1.30 1.50 2.00 1.90
rsid3 1.70 1.20 2.30 1.70 2.00
The table show the sum of 3 columns (_AA, _AB & _BB)
by multiplying with a constant factor (1, 2, 3)
for every ID (ID1, ID2, ID3, etc)
.
Example: for rsID1 --> ID1 -> (ID1_AA*1 + ID1_AB*2 + ID1_BB*3) = (0.00*1 + 0.85*2 + 0.15*3) = 2.15
I wrote the following AWK script to establish the task and it works absolutely fine.
Please note: I'm a very beginner in AWK.
awk '{
if(NR <= 1) { # header line
str = $3;
for(i=7; i<=NF; i+=3) {
split($i,s,"_”);
str = str"\t"s[1]
}
print str
} else { # data line
k = 0;
for(i=7; i<=NF; i+=3)
arr[k++] = $i*1 + $(i+1)*2 + $(i+2)*3;
str=$3;
for(i=0; i<=(NF-6)/3; i++)
str = str"\t"arr[i];
print str
}
}' input.txt > out.txt
Later I was told the input data can be as big as 60 Million rows & 300 Thousand columns which means the output data will be 60Mx100K. If I'm not wrong, AWK reads one line at a time & hence at an instant there will be 300K columns of data held in memory. Is it a problem? Given the situation, how can I improve my code?
Upvotes: 1
Views: 334
Reputation: 8164
Do u think making use of a low level language like C?
C++ or C is not automagically faster than awk, also, the code is less readable and more fragile.
I show another solution using c++
, to compare
//p.cpp
#include <stdio.h>
//to modify this value
#define COLUMNS 5
int main() {
char column3[256];
bool header=true;
while (scanf("%*s\t%*s\t%255s\t%*s\t%*s\t%*s\t", column3) == 1) {
printf("%s", column3);
if(header){
header=false;
char name[256];
for(int i=0; i<COLUMNS; ++i){
scanf("%[^_]_%*s\t%*s\t%*s\t", name);
printf("\t%s", name);
}
}else{
float nums[3];
for(int i=0; i<COLUMNS; ++i){
scanf("%f %f %f", nums, nums + 1, nums + 2);
float sum = nums[0]+nums[1]*2+nums[2]*3;
printf("\t%2.2f", sum);
}
}
printf("\n");
}
}
Run it, like
g++ p.cpp -o p
cat file | ./p
Benchmark
with 1 millon of lines in input and 300 columns
Ed Morton solution: 2m 34s
c++: 1m 19s
Upvotes: 0
Reputation: 203532
While both approaches have pros/cons and they can both handle any number of rows/columns since they only store 1 row at a time in memory, I'd use this approach rather than the answer posted by Akshay since you have 300,000 columns per line so his approach would require you to test NR==1
almost 100,000 times per line whereas the approach below will just perform the test 1 time per line so it should be noticeably more efficient:
$ cat tst.awk
BEGIN { OFS="\t" }
{
printf "%s", $3
if (NR==1) {
gsub(/_[^[:space:]]+/,"")
for (i=7; i<=NF; i+=3) {
printf "%s%s", OFS, $i
}
}
else {
for (i=7; i<=NF; i+=3) {
printf "%s%.2f", OFS, $i + $(i+1)*2 + $(i+2)*3
}
}
print ""
}
$ awk -f tst.awk file
rsid ID1 ID2 ID3 ID4 ID5
rsid1 2.15 1.50 1.70 2.10 2.90
rsid2 1.10 1.30 1.50 2.00 1.90
rsid3 1.70 1.20 2.30 1.70 2.00
I highly recommend you read the book Effective Awk Programming, 4th Edition, by Arnold Robbins to learn what awk is and how to use it.
Upvotes: 4
Reputation: 16997
awk -v OFS="\t" '
{
printf("%s",$3);
for(i=7;i<=NF; i+=3)
{
if(FNR==1)
{
sub(/_.*/,"",$i)
f = $i
}else
{
f = sprintf("%5.2f",$i*1 + $(i+1)*2 + $(i+2)*3)
}
printf("%s%s",OFS,f)
}
print ""
}
' file
Output
rsid ID1 ID2 ID3 ID4 ID5
rsid1 2.15 1.50 1.70 2.10 2.90
rsid2 1.10 1.30 1.50 2.00 1.90
rsid3 1.70 1.20 2.30 1.70 2.00
Upvotes: 0