brucezepplin
brucezepplin

Reputation: 9752

how to correctly merge tab delimited files

if I have the following two files:

00001    missing     affected          0.0132    case
00002    missing     not affected      0.042     control
00003    missing     affected          0.12      case
00004    present                                 case
00005    missing     affected          0.001     control

00001    A     406      R     case
00002    Q     612      B     case
00004    N/A   102            case                    
00005    C                    control

How can I join the two tables by the first and last columns of each file so that I get something like:

00001    missing     affected          0.0132    case    A     406      R
00002    missing     not affected      0.042     
00003    missing     affected          0.12
00004    present                                 case    N/A   102                              
00005    missing     affected          0.001     control       00005    C                    

notice that 00003 in the second table is missing and line 2 in the second table disagrees with the case column in table 1, so this is effectively an SQL equivalent of a left join.

Thank you.

Upvotes: 1

Views: 414

Answers (3)

anubhava
anubhava

Reputation: 784938

You can use this awk:

awk 'BEGIN{FS=OFS="\t"} {
   k = $1 FS $NF
} FNR==NR {
   gsub("^" $1 FS "|" FS $NF "$", "")
   a[k] = $0
   next
} {
   $NF = (k in a) ? $NF OFS a[k] : ""
} 1' file2 file1 |
column -s $'\t' -t

column -s $'\t' -t is used to format the output in tabular format.

00001  missing  affected      0.0132  case     A    406  R
00002  missing  not affected  0.042
00003  missing  affected      0.12
00004  present                        case     N/A  102
00005  missing  affected      0.001   control  C

Upvotes: 1

Ed Morton
Ed Morton

Reputation: 203209

I don't understand exactly what the algorithm would be to get that output (e.g. why is 00005 the 2nd-last field in the last line of output?) and researching what an SQL equivalent of a left join means is above my pay scale but hopefully this will get you close:

$ cat tst.awk
BEGIN { FS=OFS="\t" }
NR==FNR { a[$1,$NF] = $0; next }
{
    split(a[$1,$NF],b)
    $NF = b[5] OFS b[2] OFS b[3] OFS b[4]
    print
}

$ awk -f tst.awk file2 file1
00001   missing affected        0.0132  case    A       406     R
00002   missing not affected    0.042
00003   missing affected        0.12
00004   present                 case    N/A     102
00005   missing affected        0.001   control C

and just so you can see where the tabs line up in the above:

$ awk -f tst.awk file2 file1 | column -s$'\t' -t
00001  missing  affected      0.0132  case     A    406  R
00002  missing  not affected  0.042
00003  missing  affected      0.12
00004  present                        case     N/A  102
00005  missing  affected      0.001   control  C

If you edit your question to include the pseudo-code of whatever algorithm you're hoping to implement then I'm sure someone can help you translate that to awk.

Also if a key field can be missing from file1 but present in file2 then include that case in your sample input/output so we can see how you want to handle that.

Upvotes: 2

Lars Fischer
Lars Fischer

Reputation: 10129

First change the field separator in your input to something different than spaces (because space is inside the not affected). So lets assume the input is like this (merge1.csv):

00001;missing;affected;0.0132;case
00002;missing;not affected;0.042;control
00003;missing;affected;0.12;case
00004;present;;;case
00005;missing;affected;0.001;control

and (merge2.csv):

00001;A;406;R;case
00002;Q;612;B;case
00004;N/A;102; ;case
00005;C; ; ;control

Now

  1. create a sort key out of col 1 and 5 and sorts on it with something like <merge1.csv awk -v FS=';' -v OFS=';' '{print $1 "-" $5 ";" $0}' | sort -k1,1
  2. use two of the above together in a join command like this

    join -t';' -j1 -e "" -a 1  -o 1.2,1.3,1.4,2.6,2.3,2.4,2.5 <( <merge1.csv awk -v FS=';' -v OFS=';' '{print $1 "-" $5 ";"  $0}' | sort -k1,1 ) <( <merge2.csv awk -v FS=';' -v OFS=';' '{print $1 "-" $5 ";"  $0}' | sort -k1,1 )
    

    you want to read the 'join' man page on the options -a -e and -o

  3. This gives more or less your desired output, except the last two columns in the last row.

Upvotes: 2

Related Questions