HKJ3
HKJ3

Reputation: 477

How to merge two files together matching exactly by 2 columns?

I have file 1 with 5778 lines with 15 columns.

Sample from output_armlympho.txt:

NUMBER CHROM POS ID REF ALT A1 TEST OBS_CT BETA SE L95 U95 T_STAT P
42484 1 18052645 rs260514:18052645:G:A G A G ADD 1597 0.0147047 0.0656528 -0.113972 0.143382 0.223977 0.822804
42485 1 18054638 rs35592535:18054638:GC:G GC G G ADD 1597 0.0138673 0.0269643 -0.0389816 0.0667163 0.514286 0.607124
42486 7 18054785 rs1572792:18054785:G:A G A A ADD 1597 -0.0126002 0.0256229 -0.0628202 

I have another file with 25958247 lines and 16 columns

Sample from file1:

column1 column2 column3 column4 column5 column6 column7 column8 column9 column10    column11    column12    column13    column14    column15    column16
1 chr1_10000044_A_T_b38 ENS 171773 29 30 0.02 0.33 0.144 0.14 chr1 10000044 A T chr1 10060102 
2 chr7_10000044_A_T_b38 ENS -58627 29 30 0.024 0.26 0.16 0.15 chr7 10000044 A T chr7 18054785
4 chr1_10000044_A_T_b38 ENS 89708 29 30 0.0 0.03 -0.0 0.038 chr1 10000044 A T chr1 18054638
5 chr1_10000044_A_T_b38 ENS -472482 29 30 0.02 0.16 0.11 0.07 chr1 10000044 A T chr1 18052645

I want to merge these files together so that the second and third column from file 1 (CHROM POS) exactly matches the 15th and 16th columns from file 2 (column15 column16). However a problem is that in column15, the format is chr[number] e.g. chr1 and in file 1 is just 1. So I need a way to match 1 to chr1 or 7 to chr7 and via position. There may also be repeated lines in file 2. E.g. repeated values that are the same in column15 and column16. Both files aren't ordered in the same way.

Expected output: (outputs all the columns from file 1 and 2).

column1 column2 column3 column4 column5 column6 column7 column8 column9 column10    column11    column12    column13    column14    column15    column16 NUMBER CHROM POS ID REF ALT A1 TEST OBS_CT BETA SE L95 U95 T_STAT P
2 chr7_10000044_A_T_b38 ENS -58627 29 30 0.024 0.26 0.16 0.15 chr7 10000044 A T chr7 18054785 42486 7 18054785 42486 7 18054785 rs1572792:18054785:G:A G A A ADD 1597 -0.0126002 0.0256229 -0.0628202 
4 chr1_10000044_A_T_b38 ENS 89708 29 30 0.0 0.03 -0.0 0.038 chr1 10000044 A T chr1 18054638 42485 1 18054638 rs35592535:18054638:GC:G GC G G ADD 1597 0.0138673 0.0269643 -0.0389816 0.0667163 0.514286 0.607124 
5 chr1_10000044_A_T_b38 ENS -472482 29 30 0.02 0.16 0.11 0.07 chr1 10000044 A T chr1 18052645 42484 1 18052645 rs260514:18052645:G:A G A G ADD 1597 0.0147047 0.0656528 -0.113972 0.143382 0.223977 0.822804

Current attempt:

awk 'NR==FNR {Tmp[$3] = $16 FS $4; next} ($16 in Tmp) {print $0 FS Tmp[$16]}' output_armlympho.txt file1 > test

Upvotes: 0

Views: 69

Answers (1)

markp-fuso
markp-fuso

Reputation: 35266

Assumptions:

  • within the file output_armlympho.txt the combination of the 2nd and 3rd columns are unique

One awk idea:

awk '
FNR==1             { if (header) print $0,header; else header=$0; next }   
FNR==NR            { lines["chr" $2,$3]=$0; next }
($15,$16) in lines { print $0, lines[$15,$16] }
' output_armlympho.txt file1

This generates:

column1 column2 column3 column4 column5 column6 column7 column8 column9 column10    column11    column12    column13    column14    column15    column16 NUMBER CHROM POS ID REF ALT A1 TEST OBS_CT BETA SE L95 U95 T_STAT P
2 chr7_10000044_A_T_b38 ENS -58627 29 30 0.024 0.26 0.16 0.15 chr7 10000044 A T chr7 18054785 42486 7 18054785 rs1572792:18054785:G:A G A A ADD 1597 -0.0126002 0.0256229 -0.0628202
4 chr1_10000044_A_T_b38 ENS 89708 29 30 0.0 0.03 -0.0 0.038 chr1 10000044 A T chr1 18054638 42485 1 18054638 rs35592535:18054638:GC:G GC G G ADD 1597 0.0138673 0.0269643 -0.0389816 0.0667163 0.514286 0.607124
5 chr1_10000044_A_T_b38 ENS -472482 29 30 0.02 0.16 0.11 0.07 chr1 10000044 A T chr1 18052645 42484 1 18052645 rs260514:18052645:G:A G A G ADD 1597 0.0147047 0.0656528 -0.113972 0.143382 0.223977 0.822804

Upvotes: 0

Related Questions