How to join (merge) multiple files (by rows) using multiple columns

Question

I have multiple files with content like:

==> file1.tab <==
Contig  Position    TS-568_ALLELE   TS-568_FREQUENCY    TS-568_COVERAGE
ch00    11009393    A/C 0.02/0.97   93
ch00    11009395    A/C/T   0.01/0.97/0.01  96
ch00    11009416    A/G/T   0.12/0.83/0.04  97
ch00    11009421    A/G 0.17/0.82   97
ch00    11009427    A/C/T   0.02/0.96/0.00  101
ch01    11009436    C/T 0.99/0.00   103
ch01    11009437    G/T 0.00/0.99   104
ch02    11009441    A/G 0.07/0.92   101
ch02    11009445    G   1.0 96

==> file2.tab <==
Contig  Position    TS-602_ALLELE   TS-602_FREQUENCY    TS-602_COVERAGE
ch00    11009393    C   1.0 7
ch00    11009395    C   1.0 7
ch00    11009416    G   1.0 5
ch00    11009421    G   1.0 5
ch00    11009427    C   1.0 4
ch01    11009436    C   1.0 4
ch01    11009437    T   1.0 4
ch02    11009441    G   1.0 5
ch02    11009445    G   1.0 5

==> file3.tab <==
Contig  Position    TS-586_ALLELE   TS-586_FREQUENCY    TS-586_COVERAGE
ch00    11009393    C   1.0 34
ch00    11009395    C   1.0 35
ch00    11009416    A/G/T   0.07/0.89/0.02  39
ch00    11009421    A/G 0.10/0.89   39
ch00    11009427    A/C 0.02/0.97   37
ch01    11009436    C   1.0 44
ch01    11009437    T   1.0 44
ch02    11009441    A/G 0.06/0.93   45
ch02    11009445    G   1.0 44

I read about join in bash, but my problem here is that I need to join them based on two first columns, not a single one (ch00 11009427 is a unique identifier, ch01 11009427, or ch03 11009427 can also be in the list).

So, the expected output will be:

==> file_all.tab <==
Contig  Position    TS-568_ALLELE   TS-568_FREQUENCY    TS-568_COVERAGE TS-602_ALLELE   TS-602_FREQUENCY    TS-602_COVERAGE TS-586_ALLELE   TS-586_FREQUENCY    TS-586_COVERAGE
ch00    11009393    A/C 0.02/0.97   93  C   1.0 7   C   1.0 34
ch00    11009395    A/C/T   0.01/0.97/0.01  96  C   1.0 7   C   1.0 35
ch00    11009416    A/G/T   0.12/0.83/0.04  97  G   1.0 5   A/G/T   0.07/0.89/0.02  39
ch00    11009421    A/G 0.17/0.82   97  G   1.0 5   A/G 0.10/0.89   39
ch00    11009427    A/C/T   0.02/0.96/0.00  101 C   1.0 4   A/C 0.02/0.97   37
ch01    11009436    C/T 0.99/0.00   103 C   1.0 4   C   1.0 44
ch01    11009437    G/T 0.00/0.99   104 T   1.0 4   T   1.0 44
ch02    11009441    A/G 0.07/0.92   101 G   1.0 5   A/G 0.06/0.93   45
ch02    11009445    G   1.0 96  G   1.0 5   G   1.0 44

The files are sorted based on column 1 and column 2 and the number of rows are equal among them. And there are about 150 of those files with the total size of 6.5 GB

CWLiu · Accepted Answer

awk may achieve what you desire.

awk -v OFS='	' 'BEGIN{PROCINFO["sorted_in"]="@ind_str_asc"} {for(i=3;i<=NF;i++) a[$1 OFS $2]=a[$1 OFS $2] OFS $i}END{for(i in a)print i,a[i]}'

Brief explanation,

OFS=' ': set the output field separator as
PROCINFO["sorted_in"]="@ind_str_asc": Order the array by indices in ascending order compared as strings. Reference here for more details.
for(i=3;i<=NF;i++) a[$1 OFS $2]=a[$1 OFS $2] OFS $i: save the value to the specific indice in the array.
END{for(i in a)print i,a[i]}: print the indice and its corresponding value in the array.

How to join (merge) multiple files (by rows) using multiple columns

Answers (1)

Related Questions