Reputation: 788

Optimize AWK script for large dataset

For the following input data,

Chr C   rsid    D   A1  A2  ID1_AA  ID1_AB  ID1_BB  ID2_AA  ID2_AB  ID2_BB  ID3_AA  ID3_AB  ID3_BB  ID4_AA  ID4_AB  ID4_BB  ID5_AA  ID5_AB  ID5_BB
10  p   rsid1   q   A   G   0.00    0.85    0.15    0.70    0.10    0.20    0.40    0.50    0.10    0.30    0.30    0.40    0.10    0.20    0.80
10  p   rsid2   q   C   T   0.90    0.10    0.00    0.80    0.10    0.10    0.70    0.10    0.20    0.30    0.40    0.30    0.30    0.20    0.40
10  p   rsid3   q   A   G   0.40    0.50    0.10    0.80    0.20    0.00    0.20    0.30    0.50    0.50    0.30    0.20    0.20    0.30    0.40

I need to generate the following output data.

rsid        ID1         ID2         ID3         ID4         ID5
rsid1      2.15        1.50        1.70        2.10        2.90
rsid2      1.10        1.30        1.50        2.00        1.90
rsid3      1.70        1.20        2.30        1.70        2.00

The table show the sum of 3 columns (_AA, _AB & _BB) by multiplying with a constant factor (1, 2, 3) for every ID (ID1, ID2, ID3, etc).

Example: for rsID1 --> ID1 -> (ID1_AA*1 + ID1_AB*2 + ID1_BB*3) = (0.00*1 + 0.85*2 + 0.15*3) = 2.15

I wrote the following AWK script to establish the task and it works absolutely fine.

Please note: I'm a very beginner in AWK.

awk '{
    if(NR <= 1) { # header line
        str = $3; 
        for(i=7; i<=NF; i+=3) {
            split($i,s,"_”);
            str = str"\t"s[1]
        }
        print str
    }  else { # data line
        k = 0; 
        for(i=7; i<=NF; i+=3) 
            arr[k++] = $i*1 + $(i+1)*2 + $(i+2)*3; 
        str=$3; 
        for(i=0; i<=(NF-6)/3; i++) 
            str = str"\t"arr[i]; 
        print str
    }
}'  input.txt > out.txt

Later I was told the input data can be as big as 60 Million rows & 300 Thousand columns which means the output data will be 60Mx100K. If I'm not wrong, AWK reads one line at a time & hence at an instant there will be 300K columns of data held in memory. Is it a problem? Given the situation, how can I improve my code?

Upvotes: 1

Answers (3)

Jose Ricardo Bustos M.

Reputation: 8164

Do u think making use of a low level language like C?

C++ or C is not automagically faster than awk, also, the code is less readable and more fragile.

I show another solution using c++, to compare

//p.cpp
#include <stdio.h>

//to modify this value
#define COLUMNS 5

int main() {
    char column3[256];
    bool header=true;
    while (scanf("%*s\t%*s\t%255s\t%*s\t%*s\t%*s\t", column3) == 1) {
        printf("%s", column3);
        if(header){
            header=false;
            char name[256];
            for(int i=0; i<COLUMNS; ++i){
                scanf("%[^_]_%*s\t%*s\t%*s\t", name);
                printf("\t%s", name);
            }
        }else{
            float nums[3];
            for(int i=0; i<COLUMNS; ++i){
                scanf("%f %f %f", nums, nums + 1, nums + 2);
                float sum = nums[0]+nums[1]*2+nums[2]*3;
                printf("\t%2.2f", sum);
            }
        }
        printf("\n");
    }
}

Run it, like

g++ p.cpp -o p
cat file | ./p

Benchmark

with 1 millon of lines in input and 300 columns

Ed Morton solution: 2m 34s
c++: 1m 19s

Upvotes: 0

Ed Morton

Reputation: 203532

While both approaches have pros/cons and they can both handle any number of rows/columns since they only store 1 row at a time in memory, I'd use this approach rather than the answer posted by Akshay since you have 300,000 columns per line so his approach would require you to test NR==1 almost 100,000 times per line whereas the approach below will just perform the test 1 time per line so it should be noticeably more efficient:

$ cat tst.awk
BEGIN { OFS="\t" }
{
    printf "%s", $3
    if (NR==1) {
        gsub(/_[^[:space:]]+/,"")
        for (i=7; i<=NF; i+=3) {
            printf "%s%s", OFS, $i
        }
    }
    else {
        for (i=7; i<=NF; i+=3) {
            printf "%s%.2f", OFS, $i + $(i+1)*2 + $(i+2)*3
        }
    }
    print ""
}

$ awk -f tst.awk file
rsid    ID1     ID2     ID3     ID4     ID5
rsid1   2.15    1.50    1.70    2.10    2.90
rsid2   1.10    1.30    1.50    2.00    1.90
rsid3   1.70    1.20    2.30    1.70    2.00

I highly recommend you read the book Effective Awk Programming, 4th Edition, by Arnold Robbins to learn what awk is and how to use it.

Upvotes: 4

Akshay Hegde

Reputation: 16997

awk -v OFS="\t" '
            {
              printf("%s",$3);
              for(i=7;i<=NF; i+=3)
              {
                if(FNR==1)
                {
                   sub(/_.*/,"",$i)
                   f = $i
                }else
                {
                    f = sprintf("%5.2f",$i*1 + $(i+1)*2 + $(i+2)*3)
                }
                   printf("%s%s",OFS,f)
              }
                print ""
            }
    ' file

Output

rsid     ID1     ID2     ID3     ID4     ID5
rsid1    2.15    1.50    1.70    2.10    2.90
rsid2    1.10    1.30    1.50    2.00    1.90
rsid3    1.70    1.20    2.30    1.70    2.00

Upvotes: 0

Optimize AWK script for large dataset

Answers (3)

Related Questions