Reputation: 89
i have two very long space-separated files that look somehow like this:
file1:
CHR SNP A1 A2 MAF
1 rs12 A G 0.43
1 rs1 A T 0.22
1 1:30 G A 0.012
1 rs23 G A 0.012
file2:
SNP CHR A1 A2 MAF CHR:BP
rs21 1 G A 0.03 1:30
rs13 1 T A 0.06 1:122
rs23 1 A G 0.02 1:234
I want to merge them together, when column 2 from file 1 either fits with column 1 from file 2 or with column 6 from file 2 and print all columns from both files.
so my output for the example should be: file 3
SNP CHR A1 A2 MAF CHR:BP CHR SNP A1 A2 MAF
rs21 1 G A 0.03 1:30 1 1:30 G A 0.012
rs23 1 A G 0.02 1:234 1 rs23 G A 0.012
i used the following code:
awk 'NR==FNR{a[$2]=$0;next} ($1 in a || $6 in a){print $0 FS a[$2]}' file1 file2 > file3
But for some reason it filters out the right lines but only prints the columns of file2 but not of file1.
I also tried with join, but it gives me weird results, even when the files are sorted. Weird meaning, that it ommits roughly 1/5 of the lines that actually are present in both files. But just for completion, this is the code I tried using join:
join -1 2 -2 1 -o '2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 2.11 2.12 2.13 2.14 2.15 1.1 1.2 1.3 1.4 1.5 1.6' <(sort -k2 file1) <(sort -k1 file2) > file3.1
join -1 2 -2 6 -o '2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 2.11 2.12 2.13 2.14 2.15 1.1 1.2 1.3 1.4 1.5 1.6' <(sort -k2 file1) <(sort -k6 file2) > file3.2
cat file3.1 file 3.2 > file3
Does anybody see my mistake? I would be very greatful. Thanks!
Upvotes: 3
Views: 1214
Reputation: 133780
Could you please try following.
awk 'FNR==NR{a[$2]=$0;next} ($1 in a) || ($6 in a){print $0,a[$1]?a[$1]:a[$6]}' Input_file1 Input_file2
Adding a non-one liner form of solution here too.
awk '
FNR==NR{
a[$2]=$0
next
}
($1 in a) || ($6 in a){
print $0,a[$1]?a[$1]:a[$6]
}' Input_file1 Input_file2
Append > file3
to above ones to get the output into file3.
EDIT: Adding solution for OP's code too now.
awk 'NR==FNR{a[$2]=$0;next} ($1 in a || $6 in a){print $0 FS a[$1]?a[$1]:a[$6]}' file1 file2 > file3
Upvotes: 2
Reputation: 46896
Your error is in your print
command:
{print $0 FS a[$2]}
should be
{print $0, ( $1 in a ? a[$1] : a[$6] )}
because (1) the "interesting" field is only in $2
while you're in file1, not as you traverse file2
, and (2) you need to pull data from the array based on either of your conditions. We can select which array element using a classic ternary operator. (I've added extra spacing for easier reading.)
Note the use of the comma, which separates fields by OFS
, rather than using FS
which is your input field separator.
Otherwise, your code looks fine to me.
Upvotes: 3