Reputation: 368
I have multiple files with content like:
==> file1.tab <==
Contig Position TS-568_ALLELE TS-568_FREQUENCY TS-568_COVERAGE
ch00 11009393 A/C 0.02/0.97 93
ch00 11009395 A/C/T 0.01/0.97/0.01 96
ch00 11009416 A/G/T 0.12/0.83/0.04 97
ch00 11009421 A/G 0.17/0.82 97
ch00 11009427 A/C/T 0.02/0.96/0.00 101
ch01 11009436 C/T 0.99/0.00 103
ch01 11009437 G/T 0.00/0.99 104
ch02 11009441 A/G 0.07/0.92 101
ch02 11009445 G 1.0 96
==> file2.tab <==
Contig Position TS-602_ALLELE TS-602_FREQUENCY TS-602_COVERAGE
ch00 11009393 C 1.0 7
ch00 11009395 C 1.0 7
ch00 11009416 G 1.0 5
ch00 11009421 G 1.0 5
ch00 11009427 C 1.0 4
ch01 11009436 C 1.0 4
ch01 11009437 T 1.0 4
ch02 11009441 G 1.0 5
ch02 11009445 G 1.0 5
==> file3.tab <==
Contig Position TS-586_ALLELE TS-586_FREQUENCY TS-586_COVERAGE
ch00 11009393 C 1.0 34
ch00 11009395 C 1.0 35
ch00 11009416 A/G/T 0.07/0.89/0.02 39
ch00 11009421 A/G 0.10/0.89 39
ch00 11009427 A/C 0.02/0.97 37
ch01 11009436 C 1.0 44
ch01 11009437 T 1.0 44
ch02 11009441 A/G 0.06/0.93 45
ch02 11009445 G 1.0 44
I read about join in bash, but my problem here is that I need to join them based on two first columns, not a single one (ch00 11009427 is a unique identifier, ch01 11009427, or ch03 11009427 can also be in the list).
So, the expected output will be:
==> file_all.tab <==
Contig Position TS-568_ALLELE TS-568_FREQUENCY TS-568_COVERAGE TS-602_ALLELE TS-602_FREQUENCY TS-602_COVERAGE TS-586_ALLELE TS-586_FREQUENCY TS-586_COVERAGE
ch00 11009393 A/C 0.02/0.97 93 C 1.0 7 C 1.0 34
ch00 11009395 A/C/T 0.01/0.97/0.01 96 C 1.0 7 C 1.0 35
ch00 11009416 A/G/T 0.12/0.83/0.04 97 G 1.0 5 A/G/T 0.07/0.89/0.02 39
ch00 11009421 A/G 0.17/0.82 97 G 1.0 5 A/G 0.10/0.89 39
ch00 11009427 A/C/T 0.02/0.96/0.00 101 C 1.0 4 A/C 0.02/0.97 37
ch01 11009436 C/T 0.99/0.00 103 C 1.0 4 C 1.0 44
ch01 11009437 G/T 0.00/0.99 104 T 1.0 4 T 1.0 44
ch02 11009441 A/G 0.07/0.92 101 G 1.0 5 A/G 0.06/0.93 45
ch02 11009445 G 1.0 96 G 1.0 5 G 1.0 44
The files are sorted based on column 1 and column 2 and the number of rows are equal among them. And there are about 150 of those files with the total size of 6.5 GB
Upvotes: 0
Views: 95
Reputation: 4043
awk
may achieve what you desire.
awk -v OFS='\t' 'BEGIN{PROCINFO["sorted_in"]="@ind_str_asc"} {for(i=3;i<=NF;i++) a[$1 OFS $2]=a[$1 OFS $2] OFS $i}END{for(i in a)print i,a[i]}'
Brief explanation,
OFS='\t'
: set the output field separator as \t
PROCINFO["sorted_in"]="@ind_str_asc"
: Order the array by indices in ascending order compared as strings. Reference here for more details.for(i=3;i<=NF;i++) a[$1 OFS $2]=a[$1 OFS $2] OFS $i
: save the value to the specific indice in the array.END{for(i in a)print i,a[i]}
: print the indice and its corresponding value in the array.Upvotes: 1