user2380782
user2380782

Reputation: 1446

awk replacement using specific values in two columns

I have a file that looks like this:

20:60479_C_T 60479 C T  0 0 0 0 0 1 0 1
20:60522_T_TC 60522 T TC        0 0 0 0 0 0 0 
20:60568_A_C 60568 A C  0 0 1 0 0 1 
20:60571_C_A 60571 C A  0 1 0 1 0 0 
20:60579_G_A 60579 G A  0 0 1 0 0 0 

My current file is bigger with 3 million rows and 3,000 columns. I want to use the values in columns $3 and $4 to replace 0 and 1 in the rest of columns. The desired output would be:

20:60479_C_T 60479 C T  C C C C C T C T
20:60522_T_TC 60522 T TC        T T T T T T T 
20:60568_A_C 60568 A C  A A C A A C 
20:60571_C_A 60571 C A  C A C A C C 
20:60579_G_A 60579 G A  G G A G G G 

I know how to do it for a couple of columns:

awk '{d["0"]=$3; d["1"]=$4; print "20", $1, "0", $2, d[$5], d[$6];}' myfile

But I don't know how to do it automatically for all the columns and avoid adding all the columns manually

Upvotes: 1

Views: 56

Answers (3)

paxdiablo
paxdiablo

Reputation: 881103

Since you have a variable number of columns, you can probably get away with something like:

awk <testprog.in '{for (i = 5; i <= NF; i++){$i = $($i+3)}print}'

The "magic" here is the assigning of $($i+3) to $i for all values of i between 5 and the field count (inclusive).

The expression $i+3 will turn 0 and 1 into 3 and 4 respectively and so the next step will be evaluating $3 or $4 (the C and T in the first line for example) and using that to replace the item.

The output of you small test case is, as expected:

20:60479_C_T 60479 C T C C C C C T C T
20:60522_T_TC 60522 T TC T T T T T T T
20:60568_A_C 60568 A C A A C A A C
20:60571_C_A 60571 C A C A C A C C
20:60579_G_A 60579 G A G G A G G G

You will, of course, need to check the performance of this with your larger data sets. On my box, a three-million-line file with 3000 entries each takes about half an hour.

Compare that with a C program (although admittedly quick'n'dirty, heavily tied to your specific input data, without what I would generally consider necessary error checking) which only takes about ten minutes.

For completeness, here's the C variant which, assuming it's called prog.c, you can compile with something like gcc -o prog prog.c and run with something like ./prog <testprog.in:

#include <stdio.h>
#include <ctype.h>

static char buff[102040];

static char *getStr(char *buff, int *pSz) {
    if (*buff == 0) return NULL;

    char *nextBuff = buff;
    while ((nextBuff[0] != 0) && isspace(nextBuff[0])) {
        nextBuff++;
    }
    if (*nextBuff == 0) return NULL;

    *pSz = 0;
    while ((nextBuff[*pSz] != 0) && ! isspace(nextBuff[*pSz])) {
        (*pSz)++;
    }

    return nextBuff;
}

int main(void) {
    char *str, *str3, *str4; int sz, sz3, sz4;

    while (fgets(buff, sizeof(buff), stdin) != NULL) {
        str = getStr(buff, &sz); printf("%*.*s", sz, sz, str);
        str = getStr(str + sz, &sz); printf(" %*.*s", sz, sz, str);
        str3 = getStr(str + sz, &sz3); printf(" %*.*s", sz3, sz3, str3); 
        str4 = getStr(str3 + sz3, &sz4); printf(" %*.*s", sz4, sz4, str4);

        str = getStr(str4 + sz4, &sz);
        while (str != NULL) {
            if (*str == '0') {
                printf(" %*.*s", sz3, sz3, str3);
            } else {
                printf(" %*.*s", sz4, sz4, str4);
            }
            str = getStr(str + sz, &sz);
        }
        printf("\n");
    }
    return 0;
}

Upvotes: 0

Ed Morton
Ed Morton

Reputation: 203169

$ awk '{d[0]=$3; d[1]=$4; for (i=5; i<=NF; i++) $i=d[$i]} 1' file
20:60479_C_T 60479 C T C C C C C T C T
20:60522_T_TC 60522 T TC T T T T T T T
20:60568_A_C 60568 A C A A C A A C
20:60571_C_A 60571 C A C A C A C C
20:60579_G_A 60579 G A G G A G G G

Upvotes: 1

Joaquin
Joaquin

Reputation: 2091

Using gsub in awk you could try this as an option:

$ awk '{d[1]=$1;d[2]=$2;gsub(/0/,$3);gsub(/1/,$4);$1=d[1];$2=d[2];}1' myfile
20:60479_C_T 60479 C T C C C C C T C T
20:60522_T_TC 60522 T TC T T T T T T T
20:60568_A_C 60568 A C A A C A A C
20:60571_C_A 60571 C A C A C A C C
20:60579_G_A 60579 G A G G A G G G

Upvotes: 0

Related Questions