AndraD
AndraD

Reputation: 2850

awk: how to extract from file A the columns with indexes specified in file B?

My file A has the format below, with hundreds of thousands of columns and thousands of rows:

1000->100001    DOSE    2.000   2.000   2.000   2.000   2.000 ....
1001->100101    DOSE    1.988   1.988   2.000   2.000   2.000 ....
1001->100101    DOSE    1.933   2.000   2.000   2.000   2.000 ....
1002->100201    DOSE    2.000   2.000   2.000   2.000   2.000 ....
1002->100201    DOSE    2.000   2.000   2.000   2.000   2.000 ....

While my file B (thousands of entries) has the following format:

SNP,Al1,Al2,Freq1,MAF,AvgCall,Rsq,Genotyped,key,pos,gene_key
20:29649365,C,T,0.93021,0.06979,0.93021,0.10115,,803428,29649365,12
20:29649737,A,G,0.93914,0.06086,0.93914,0.14303,,803442,29649737,12
20:29649765,T,G,0.99963,0.00037,0.99963,0.13918,,803443,29649765,12
20:29650462,A,T,0.89387,0.10613,0.89388,0.12864,,803456,29650462,12

What I want to do is extract from file A the first two columns plus the columns specified in the "key" column from file B. I spent some time trying to figure out how to do this using awk, but it's been taking way too long for something that theoretically should not be too difficult.

My specific question is: once I extract the "key" entries from file B, how do I pipe these values into the awk command for file A?

Manual command for the first 3 columns in file B:

awk '{print $1, $2, $803428, $803442, $803442}' fileA > output.txt 

EDITS:

There is no common column between file A and file B. The values in column 803428 in file A would represent results for the SNP "20:29649365" from file B.

If the command to be run were:

awk '{print $1, $2, $3, $4, $5}' fileA

The result would be:

1000->100001 DOSE 2.000 2.000 2.000 2.000 2.000
1001->100101 DOSE 1.988 1.988 2.000 2.000 2.000
1002->100201 DOSE 1.933 1.999 2.000 2.000 2.000
1003->100301 DOSE 2.000 2.000 2.000 2.000 2.000
1004->100401 DOSE 2.000 2.000 2.000 2.000 2.000

Specific examples I am using to test the awk commands given as solutions.

fileA:

1000->100001    DOSE    2.000   2.000   2.000   2.000   2.000
1001->100101    DOSE    1.988   1.988   2.000   2.000   2.000
1001->100101    DOSE    1.933   2.000   1.500   2.000   2.000
1002->100201    DOSE    2.000   2.000   2.000   2.000   1.622
1002->100201    DOSE    2.000   2.000   2.000   2.000   2.000

fileB:

SNP,Al1,Al2,Freq1,MAF,AvgCall,Rsq,Genotyped,key,pos,gene_key
20:29649365,C,T,0.93021,0.06979,0.93021,0.10115,,3,29649365,12
20:29649737,A,G,0.93914,0.06086,0.93914,0.14303,,4,29649737,12
20:29650462,A,T,0.89387,0.10613,0.89388,0.12864,,6,29650462,12

Desired output (columns 1, 2, 3, 4, 6 - last 3 from the key column of fileB):

1000->100001    DOSE    2.000   2.000   2.000
1001->100101    DOSE    1.988   1.988   2.000
1001->100101    DOSE    1.933   2.000   2.000
1002->100201    DOSE    2.000   2.000   2.000
1002->100201    DOSE    2.000   2.000   2.000

Upvotes: 3

Views: 888

Answers (3)

Ruud Helderman
Ruud Helderman

Reputation: 11028

Alternative approach: apply an awk script on fileB that generates another awk script, to be applied on fileA.

#!/bin/bash
awk -F, 'NR>1{a=a",$"$9}END{print"{print $1,$2"a"}"}' < fileB > cols.awk
awk -f cols.awk fileA

Or as a one-liner, without intermediate file:

#!/bin/bash
awk "$(awk -F, 'NR>1{a=a",$"$9}END{print"{print $1,$2"a"}"}' < fileB)" fileA

Upvotes: 1

Kent
Kent

Reputation: 195289

if you want to read the column index from fileB (the key column, not the pos) and print those columns from fileA, also keep the order of the col index in fileB, you can try:

awk 'NR==FNR{c[NR]=$(NF-2);n=NR;next}
{printf "%s %s",$1,$2;
for(i=2;i<=n;i++)printf " %s",$c[i];print ""}' FS=',' fileB FS=' ' fileA

Upvotes: 3

amaurea
amaurea

Reputation: 5067

Ok, here's an updated version which should reproduce your output.

awk 'ARGIND==2&&!/SNP/{cols[++i]=$9}ARGIND==4{printf("%s %s",$1,$2);
      for(j=1;j<=i;j++)printf(" %s%s",$cols[j],j<i?"":"\n");}' FS=',' B.txt FS='[ \t]+' A.txt 

It's complicated a bit by the different field separators in the two files and the header in the B-file that must be ignored. But the key here is that $ accepts a variable, not just a constant integer. This approach stores the list of columns in the cols array, and iterates through them for each row in A.

Switching field separators between files is done via dummy files FS=',' and 'FS='[ \t]+'`, which means that our actual files have arginds of 2 and 4.

Upvotes: 3

Related Questions