Aron
Aron

Reputation: 89

Merging two files on a common column and printing all columns using awk

i have two very long space-separated files that look somehow like this:

file1:

CHR SNP   A1 A2 MAF
1    rs12  A G  0.43
1    rs1   A T  0.22
1    1:30  G A  0.012
1    rs23  G A  0.012

file2:

SNP    CHR A1 A2 MAF CHR:BP
rs21   1   G  A  0.03 1:30
rs13   1   T  A  0.06 1:122
rs23   1   A  G  0.02 1:234

I want to merge them together, when column 2 from file 1 either fits with column 1 from file 2 or with column 6 from file 2 and print all columns from both files.

so my output for the example should be: file 3

SNP  CHR A1 A2 MAF CHR:BP CHR SNP  A1 A2 MAF
rs21 1   G  A  0.03 1:30   1  1:30 G  A  0.012
rs23 1   A  G  0.02 1:234  1  rs23 G  A  0.012

i used the following code:

awk 'NR==FNR{a[$2]=$0;next} ($1 in a || $6 in a){print $0 FS a[$2]}' file1 file2 > file3

But for some reason it filters out the right lines but only prints the columns of file2 but not of file1.

I also tried with join, but it gives me weird results, even when the files are sorted. Weird meaning, that it ommits roughly 1/5 of the lines that actually are present in both files. But just for completion, this is the code I tried using join:

join -1 2 -2 1 -o '2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 2.11 2.12 2.13 2.14 2.15 1.1 1.2 1.3 1.4 1.5 1.6' <(sort -k2 file1) <(sort -k1 file2) > file3.1

join -1 2 -2 6 -o '2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 2.11 2.12 2.13 2.14 2.15 1.1 1.2 1.3 1.4 1.5 1.6' <(sort -k2 file1) <(sort -k6 file2) > file3.2

cat file3.1 file 3.2 > file3

Does anybody see my mistake? I would be very greatful. Thanks!

Upvotes: 3

Views: 1214

Answers (2)

RavinderSingh13
RavinderSingh13

Reputation: 133780

Could you please try following.

awk 'FNR==NR{a[$2]=$0;next} ($1 in a) || ($6 in a){print $0,a[$1]?a[$1]:a[$6]}'  Input_file1  Input_file2

Adding a non-one liner form of solution here too.

awk '
FNR==NR{
  a[$2]=$0
  next
}
($1 in a) || ($6 in a){
  print $0,a[$1]?a[$1]:a[$6]
}'  Input_file1   Input_file2

Append > file3 to above ones to get the output into file3.

EDIT: Adding solution for OP's code too now.

awk 'NR==FNR{a[$2]=$0;next} ($1 in a || $6 in a){print $0 FS a[$1]?a[$1]:a[$6]}' file1 file2 > file3

Upvotes: 2

ghoti
ghoti

Reputation: 46896

Your error is in your print command:

{print $0 FS a[$2]}

should be

{print $0, ( $1 in a ? a[$1] : a[$6] )}

because (1) the "interesting" field is only in $2 while you're in file1, not as you traverse file2, and (2) you need to pull data from the array based on either of your conditions. We can select which array element using a classic ternary operator. (I've added extra spacing for easier reading.)

Note the use of the comma, which separates fields by OFS, rather than using FS which is your input field separator.

Otherwise, your code looks fine to me.

Upvotes: 3

Related Questions