Reputation: 9752
if I have the following two files:
00001 missing affected 0.0132 case
00002 missing not affected 0.042 control
00003 missing affected 0.12 case
00004 present case
00005 missing affected 0.001 control
00001 A 406 R case
00002 Q 612 B case
00004 N/A 102 case
00005 C control
How can I join the two tables by the first and last columns of each file so that I get something like:
00001 missing affected 0.0132 case A 406 R
00002 missing not affected 0.042
00003 missing affected 0.12
00004 present case N/A 102
00005 missing affected 0.001 control 00005 C
notice that 00003 in the second table is missing and line 2 in the second table disagrees with the case column in table 1, so this is effectively an SQL equivalent of a left join.
Thank you.
Upvotes: 1
Views: 414
Reputation: 784938
You can use this awk:
awk 'BEGIN{FS=OFS="\t"} {
k = $1 FS $NF
} FNR==NR {
gsub("^" $1 FS "|" FS $NF "$", "")
a[k] = $0
next
} {
$NF = (k in a) ? $NF OFS a[k] : ""
} 1' file2 file1 |
column -s $'\t' -t
column -s $'\t' -t
is used to format the output in tabular format.
00001 missing affected 0.0132 case A 406 R
00002 missing not affected 0.042
00003 missing affected 0.12
00004 present case N/A 102
00005 missing affected 0.001 control C
Upvotes: 1
Reputation: 203209
I don't understand exactly what the algorithm would be to get that output (e.g. why is 00005
the 2nd-last field in the last line of output?) and researching what an SQL equivalent of a left join
means is above my pay scale but hopefully this will get you close:
$ cat tst.awk
BEGIN { FS=OFS="\t" }
NR==FNR { a[$1,$NF] = $0; next }
{
split(a[$1,$NF],b)
$NF = b[5] OFS b[2] OFS b[3] OFS b[4]
print
}
$ awk -f tst.awk file2 file1
00001 missing affected 0.0132 case A 406 R
00002 missing not affected 0.042
00003 missing affected 0.12
00004 present case N/A 102
00005 missing affected 0.001 control C
and just so you can see where the tabs line up in the above:
$ awk -f tst.awk file2 file1 | column -s$'\t' -t
00001 missing affected 0.0132 case A 406 R
00002 missing not affected 0.042
00003 missing affected 0.12
00004 present case N/A 102
00005 missing affected 0.001 control C
If you edit your question to include the pseudo-code of whatever algorithm you're hoping to implement then I'm sure someone can help you translate that to awk.
Also if a key field can be missing from file1 but present in file2 then include that case in your sample input/output so we can see how you want to handle that.
Upvotes: 2
Reputation: 10129
First change the field separator in your input to something different than spaces (because space is inside the not affected
). So lets assume the input is like this (merge1.csv):
00001;missing;affected;0.0132;case
00002;missing;not affected;0.042;control
00003;missing;affected;0.12;case
00004;present;;;case
00005;missing;affected;0.001;control
and (merge2.csv):
00001;A;406;R;case
00002;Q;612;B;case
00004;N/A;102; ;case
00005;C; ; ;control
Now
<merge1.csv awk -v FS=';' -v OFS=';' '{print $1 "-" $5 ";" $0}' | sort -k1,1
use two of the above together in a join command like this
join -t';' -j1 -e "" -a 1 -o 1.2,1.3,1.4,2.6,2.3,2.4,2.5 <( <merge1.csv awk -v FS=';' -v OFS=';' '{print $1 "-" $5 ";" $0}' | sort -k1,1 ) <( <merge2.csv awk -v FS=';' -v OFS=';' '{print $1 "-" $5 ";" $0}' | sort -k1,1 )
you want to read the 'join' man page on the options -a
-e
and -o
Upvotes: 2