Reputation: 1446
I have a file that looks like this:
20:60479_C_T 60479 C T 0 0 0 0 0 1 0 1
20:60522_T_TC 60522 T TC 0 0 0 0 0 0 0
20:60568_A_C 60568 A C 0 0 1 0 0 1
20:60571_C_A 60571 C A 0 1 0 1 0 0
20:60579_G_A 60579 G A 0 0 1 0 0 0
My current file is bigger with 3 million rows and 3,000 columns. I want to use the values in columns $3
and $4
to replace 0
and 1
in the rest of columns. The desired output would be:
20:60479_C_T 60479 C T C C C C C T C T
20:60522_T_TC 60522 T TC T T T T T T T
20:60568_A_C 60568 A C A A C A A C
20:60571_C_A 60571 C A C A C A C C
20:60579_G_A 60579 G A G G A G G G
I know how to do it for a couple of columns:
awk '{d["0"]=$3; d["1"]=$4; print "20", $1, "0", $2, d[$5], d[$6];}' myfile
But I don't know how to do it automatically for all the columns and avoid adding all the columns manually
Upvotes: 1
Views: 56
Reputation: 881103
Since you have a variable number of columns, you can probably get away with something like:
awk <testprog.in '{for (i = 5; i <= NF; i++){$i = $($i+3)}print}'
The "magic" here is the assigning of $($i+3)
to $i
for all values of i
between 5 and the field count (inclusive).
The expression $i+3
will turn 0
and 1
into 3
and 4
respectively and so the next step will be evaluating $3
or $4
(the C
and T
in the first line for example) and using that to replace the item.
The output of you small test case is, as expected:
20:60479_C_T 60479 C T C C C C C T C T
20:60522_T_TC 60522 T TC T T T T T T T
20:60568_A_C 60568 A C A A C A A C
20:60571_C_A 60571 C A C A C A C C
20:60579_G_A 60579 G A G G A G G G
You will, of course, need to check the performance of this with your larger data sets. On my box, a three-million-line file with 3000 entries each takes about half an hour.
Compare that with a C program (although admittedly quick'n'dirty, heavily tied to your specific input data, without what I would generally consider necessary error checking) which only takes about ten minutes.
For completeness, here's the C variant which, assuming it's called prog.c
, you can compile with something like gcc -o prog prog.c
and run with something like ./prog <testprog.in
:
#include <stdio.h>
#include <ctype.h>
static char buff[102040];
static char *getStr(char *buff, int *pSz) {
if (*buff == 0) return NULL;
char *nextBuff = buff;
while ((nextBuff[0] != 0) && isspace(nextBuff[0])) {
nextBuff++;
}
if (*nextBuff == 0) return NULL;
*pSz = 0;
while ((nextBuff[*pSz] != 0) && ! isspace(nextBuff[*pSz])) {
(*pSz)++;
}
return nextBuff;
}
int main(void) {
char *str, *str3, *str4; int sz, sz3, sz4;
while (fgets(buff, sizeof(buff), stdin) != NULL) {
str = getStr(buff, &sz); printf("%*.*s", sz, sz, str);
str = getStr(str + sz, &sz); printf(" %*.*s", sz, sz, str);
str3 = getStr(str + sz, &sz3); printf(" %*.*s", sz3, sz3, str3);
str4 = getStr(str3 + sz3, &sz4); printf(" %*.*s", sz4, sz4, str4);
str = getStr(str4 + sz4, &sz);
while (str != NULL) {
if (*str == '0') {
printf(" %*.*s", sz3, sz3, str3);
} else {
printf(" %*.*s", sz4, sz4, str4);
}
str = getStr(str + sz, &sz);
}
printf("\n");
}
return 0;
}
Upvotes: 0
Reputation: 203169
$ awk '{d[0]=$3; d[1]=$4; for (i=5; i<=NF; i++) $i=d[$i]} 1' file
20:60479_C_T 60479 C T C C C C C T C T
20:60522_T_TC 60522 T TC T T T T T T T
20:60568_A_C 60568 A C A A C A A C
20:60571_C_A 60571 C A C A C A C C
20:60579_G_A 60579 G A G G A G G G
Upvotes: 1
Reputation: 2091
Using gsub
in awk
you could try this as an option:
$ awk '{d[1]=$1;d[2]=$2;gsub(/0/,$3);gsub(/1/,$4);$1=d[1];$2=d[2];}1' myfile
20:60479_C_T 60479 C T C C C C C T C T
20:60522_T_TC 60522 T TC T T T T T T T
20:60568_A_C 60568 A C A A C A A C
20:60571_C_A 60571 C A C A C A C C
20:60579_G_A 60579 G A G G A G G G
Upvotes: 0