Reputation: 2850
My file A has the format below, with hundreds of thousands of columns and thousands of rows:
1000->100001 DOSE 2.000 2.000 2.000 2.000 2.000 ....
1001->100101 DOSE 1.988 1.988 2.000 2.000 2.000 ....
1001->100101 DOSE 1.933 2.000 2.000 2.000 2.000 ....
1002->100201 DOSE 2.000 2.000 2.000 2.000 2.000 ....
1002->100201 DOSE 2.000 2.000 2.000 2.000 2.000 ....
While my file B (thousands of entries) has the following format:
SNP,Al1,Al2,Freq1,MAF,AvgCall,Rsq,Genotyped,key,pos,gene_key
20:29649365,C,T,0.93021,0.06979,0.93021,0.10115,,803428,29649365,12
20:29649737,A,G,0.93914,0.06086,0.93914,0.14303,,803442,29649737,12
20:29649765,T,G,0.99963,0.00037,0.99963,0.13918,,803443,29649765,12
20:29650462,A,T,0.89387,0.10613,0.89388,0.12864,,803456,29650462,12
What I want to do is extract from file A the first two columns plus the columns specified in the "key" column from file B. I spent some time trying to figure out how to do this using awk, but it's been taking way too long for something that theoretically should not be too difficult.
My specific question is: once I extract the "key" entries from file B, how do I pipe these values into the awk command for file A?
Manual command for the first 3 columns in file B:
awk '{print $1, $2, $803428, $803442, $803442}' fileA > output.txt
EDITS:
There is no common column between file A and file B. The values in column 803428 in file A would represent results for the SNP "20:29649365" from file B.
If the command to be run were:
awk '{print $1, $2, $3, $4, $5}' fileA
The result would be:
1000->100001 DOSE 2.000 2.000 2.000 2.000 2.000
1001->100101 DOSE 1.988 1.988 2.000 2.000 2.000
1002->100201 DOSE 1.933 1.999 2.000 2.000 2.000
1003->100301 DOSE 2.000 2.000 2.000 2.000 2.000
1004->100401 DOSE 2.000 2.000 2.000 2.000 2.000
Specific examples I am using to test the awk commands given as solutions.
fileA:
1000->100001 DOSE 2.000 2.000 2.000 2.000 2.000
1001->100101 DOSE 1.988 1.988 2.000 2.000 2.000
1001->100101 DOSE 1.933 2.000 1.500 2.000 2.000
1002->100201 DOSE 2.000 2.000 2.000 2.000 1.622
1002->100201 DOSE 2.000 2.000 2.000 2.000 2.000
fileB:
SNP,Al1,Al2,Freq1,MAF,AvgCall,Rsq,Genotyped,key,pos,gene_key
20:29649365,C,T,0.93021,0.06979,0.93021,0.10115,,3,29649365,12
20:29649737,A,G,0.93914,0.06086,0.93914,0.14303,,4,29649737,12
20:29650462,A,T,0.89387,0.10613,0.89388,0.12864,,6,29650462,12
Desired output (columns 1, 2, 3, 4, 6 - last 3 from the key column of fileB):
1000->100001 DOSE 2.000 2.000 2.000
1001->100101 DOSE 1.988 1.988 2.000
1001->100101 DOSE 1.933 2.000 2.000
1002->100201 DOSE 2.000 2.000 2.000
1002->100201 DOSE 2.000 2.000 2.000
Upvotes: 3
Views: 888
Reputation: 11028
Alternative approach: apply an awk script on fileB that generates another awk script, to be applied on fileA.
#!/bin/bash
awk -F, 'NR>1{a=a",$"$9}END{print"{print $1,$2"a"}"}' < fileB > cols.awk
awk -f cols.awk fileA
Or as a one-liner, without intermediate file:
#!/bin/bash
awk "$(awk -F, 'NR>1{a=a",$"$9}END{print"{print $1,$2"a"}"}' < fileB)" fileA
Upvotes: 1
Reputation: 195289
if you want to read the column index from fileB (the key
column, not the pos
) and print those columns from fileA, also keep the order of the col index in fileB, you can try:
awk 'NR==FNR{c[NR]=$(NF-2);n=NR;next}
{printf "%s %s",$1,$2;
for(i=2;i<=n;i++)printf " %s",$c[i];print ""}' FS=',' fileB FS=' ' fileA
Upvotes: 3
Reputation: 5067
Ok, here's an updated version which should reproduce your output.
awk 'ARGIND==2&&!/SNP/{cols[++i]=$9}ARGIND==4{printf("%s %s",$1,$2);
for(j=1;j<=i;j++)printf(" %s%s",$cols[j],j<i?"":"\n");}' FS=',' B.txt FS='[ \t]+' A.txt
It's complicated a bit by the different field separators in the two files and the header in the B-file that must be ignored. But the key here is that $
accepts a variable, not just a constant integer. This approach stores the list of columns in the cols
array, and iterates through them for each row in A.
Switching field separators between files is done via dummy files FS=','
and 'FS='[ \t]+'`, which means that our actual files have arginds of 2 and 4.
Upvotes: 3