Manuel Ramón
Manuel Ramón

Reputation: 2498

Merge two data sets using AWK

I have 2 data files: file01 and file02. In both data sets fields are: (i) an identificator; (ii) a numeric reference; (iii) longitude; and (iv) latitude. For each row in file01, I want to seach the data in file02 with the same numeric reference and then, find the identificator in file02 which is nearest to the identificator in file01.

I can get this if I pass manually the values from file01 to the awk program using the following code:

awk 'function acos(x) { return atan2(sqrt(1-x*x), x) }
BEGIN {pi=3.14159;
       ndist=999999999.1;
       date=1001;
       lo1=-1.20; lg1=lo1*(pi/180);
       la1=30.31; lt1=la1*(pi/180)
           }
{if($2==date) {ws=$1;
               lg2=$3*(pi/180);
               lt2=$4*(pi/180);
               dist= 6378.7 * acos( sin(lt1)*sin(lt2) + cos(lt1)*cos(lt2)*cos(lg2-lg1) );
               if(dist < ndist) {ndist=dist; ws0=ws}}}
END {print(ws0,ndist)}' file02

As you see, date, lo1 and la1 in the BEGIN statement are the values in the 1st row of file01 (see below for data files). My question is if I could do that at once, so each time I read a row in file01, I get the nearest identificator and the distance and append to the row data in file01. I do not know if some shell command could achieve this in a easier way, maybe using a pipe.

An example of these two data files and the desired output are:

=== file01 ===

A 1001 -1.2 30.31
A 1002 -1.2 30.31
B 1002 -1.8 30.82
B 1003 -1.8 30.82
C 1001 -2.1 28.55

=== file02 ===

ws1 1000 -1.3 29.01
ws1 1001 -1.3 29.01
ws1 1002 -1.3 29.01
ws1 1003 -1.3 29.01
ws1 1004 -1.3 29.01
ws1 1005 -1.3 29.01
ws2 1000 -1.5 30.12
ws2 1002 -1.5 30.12
ws2 1003 -1.5 30.12
ws2 1004 -1.5 30.12
ws2 1005 -1.5 30.12
ws3 1000 -1.7 29.55
ws3 1001 -1.7 29.55
ws3 1002 -1.7 29.55
ws3 1003 -1.7 29.55
ws3 1004 -1.7 29.55
ws3 1005 -1.7 29.55
ws4 1000 -1.9 30.33
ws4 1001 -1.9 30.33
ws4 1002 -1.9 30.33
ws4 1003 -1.9 30.33
ws4 1004 -1.9 30.33
ws4 1005 -1.9 30.33

=== output file ===

A 1001 -1.2 30.31 ws4 67.308
A 1002 -1.2 30.31 ws2 35.783
B 1002 -1.8 30.82 ws4 55.387
B 1003 -1.8 30.82 ws4 55.387
C 1001 -2.1 28.55 ws1 85.369

EDIT #1: Considering the suggestion by @Eran, I wrote the following code:

join -j 2 < (sort -k 2,2 file01) < (sort -k 2,2 file02) |
awk 'function acos(x) { return atan2(sqrt(1-x*x), x) }
     BEGIN {pi=3.14159}

     {if (last != $1 $2)
         {print NR, id,r,lon,lat,ws0,ndist;
          last = $1 $2;
          ndist=999999999.1

         } else {

          lg1=$3*(pi/180);
          lt1=$4*(pi/180);
          lg2=$6*(pi/180);
          lt2=$7*(pi/180);
          dist= 6378.7 * acos( sin(lt1)*sin(lt2) + cos(lt1)*cos(lt2)*cos(lg2-lg1) );
          if(dist< ndist) {ndist=dist; ws0=$5}
          id=$2;r=$1;lon=$3;lat=$4

          }
     }'

The output from this script is:

1      
4  A 1001 -1.2 30.31 ws4 67.3078
7  C 1001 -2.0 28.55 ws3 115.094
11 A 1002 -1.2 30.31 ws2 35.7827
15 B 1002 -1.8 30.82 ws4 55.387

EDIT #2: Using athe suggestion of @Dennis (with some modifications) I have got the desired output. The awk script is as follows:


awk 'function acos(x) { return atan2(sqrt(1-x*x), x) }
     BEGIN {pi=3.14159}
     NR==FNR {c++; a1[c]=$1;a2[c]=$2;a3[c]=$3;a4[c]=$4; next}
             {d++; b1[d]=$1;b2[d]=$2;b3[d]=$3;b4[d]=$4}

     END {
     for(k=1;k<=c;k++) {
         lg1=a3[k]*(pi/180);
         lt1=a4[k]*(pi/180);
         ndist=999999999.1;
         for(l=1;l<=d;l++) {
             if(b2[l]==a2[k]) {kk=b2[l];
                lg2=b3[l]*(pi/180);
                lt2=b4[l]*(pi/180);
                dist= 6378.7 * acos( sin(lt1)*sin(lt2) + cos(lt1)*cos(lt2)*cos(lg2-lg1) );
                if(dist<ndist) {ndist=dist; ws0=b1[l]}
             }
         }
         print a1[k],a2[k],a3[k],a4[k],ws0,ndist
     }
    }' file01 file02

Upvotes: 2

Views: 831

Answers (4)

Kaz
Kaz

Reputation: 58617

TXR:

@(do
   (defvar pi 3.1415926535)
   (defvar earth-radius 6378.7)
   (defun rad (deg) (/ (* deg pi) 180))
   (defun sphere-distance (lat0 lon0 lat1 lon1)
     (let ((lat0 (rad lat0)) (lat1 (rad lat1))
           (lon0 (rad lon0)) (lon1 (rad lon1)))
       (* earth-radius (acos (+ (* (sin lat0) (sin lat1))
                                (* (cos lat0) (cos lat1) (cos (- lon1 lon0)))))))))
@(next "file01")
@(collect)
@id @ref @lon0 @lat0
@  (filter :tonumber lon0 lat0)
@  (next "file02")
@  (bind (min-dist ws) (1e99 nil))
@  (collect)
@ws1 @ref @lon1 @lat1
@    (filter :tonumber lon1 lat1)
@    (do (let ((d (sphere-distance lat0 lon0 lat1 lon1)))
           (cond ((< d min-dist)
                  (set min-dist d)
                  (set ws ws1)))))
@  (end)
@  (do (format t "~a ~a ~0,2f ~0,2f ~a ~0,3f\n" id ref lon0 lat0 ws min-dist))
@(end)

Run:

$ txr dist.txr
A 1001 -1.20 30.31 ws4 67.308
A 1002 -1.20 30.31 ws2 35.783
B 1002 -1.80 30.82 ws4 55.387
B 1003 -1.80 30.82 ws4 55.387
C 1001 -2.10 28.55 ws1 93.361

Upvotes: 0

Eran Ben-Natan
Eran Ben-Natan

Reputation: 2615

To do it at once, run

join -j 2 <(sort -k 2,2 file01) <(sort -k 2,2 file02)

And pipe it to a awk which on every change in reference do the calc:

gawk '{if (last != $1 $2) {calc_nearest_on_array; last=$1 $2; add_point_to_array} else {add_point_to_array}}'

Upvotes: 1

glenn jackman
glenn jackman

Reputation: 247022

Interesting challenge. Since you have to read in file02 first and store in info in a data structure, I'd lean first to Perl.

#!/usr/bin/perl
use strict;
use warnings;

# see http://perldoc.perl.org/Math/Trig.html
use Math::Trig qw(great_circle_distance deg2rad);
sub NESW {deg2rad($_[0]), deg2rad(90-$_[1])}

# read file02
my %data;
my $file2 = 'file02';
open my $fid, '<', $file2 or die "can't read $file2: $!\n";
while (<$fid>) {
    my ($id, $ref, $long, $lat) = split;
    push @{$data{$ref}}, [$id, $long, $lat];
}
close $fid;

$, = " ";

# process file01
my $file1 = 'file01';
open $fid, '<', $file1 or die "can't read $file1: $!\n";
while (<$fid>) {
    my ($id, $ref, $long, $lat) = split;
    my @here = NESW($long, $lat);
    my $min = 99_999_999;
    my (@min_id, $dist);

    while (my ($key, $listref) = each %data) {
        next unless $key == $ref;

        foreach my $trioref (@$listref) {
            my ($other_id, $other_long, $other_lat) = @$trioref;
            my @there = NESW($other_long, $other_lat);
            $dist = great_circle_distance(@here, @there, 6378.7);
            if ($dist < $min) {
                $min = $dist;
                @min_id = @$trioref;
            }
        }
    }

    printf "%s %d %s %s %s %6.3f\n", $id, $ref, $long, $lat, $min_id, $min;
}
close $fid;

This outputs

A 1001 -1.2 30.31 ws4 67.308
A 1002 -1.2 30.31 ws2 35.783
B 1002 -1.8 30.82 ws4 55.387
B 1003 -1.8 30.82 ws4 55.387
C 1001 -2.1 28.55 ws1 93.361

I notice the "C" distance is different from what you suggest it should be.

Upvotes: 1

Dennis Williamson
Dennis Williamson

Reputation: 360345

Read your values from file01 into one or more arrays. You can use getline in the BEGIN block or the canonical way is to use a FNR == NR loop as one of the main blocks.

FNR == NR {array[$1] = $1; ...; next } # read file01 into some arrays
{ for (item in array) { ... }     # process each item in the array(s) against each line in file02

Your script would be invoked as awk '...' file01 file02

Instead of indexing the arrays by field values, you could index them with a counter array1[c] = $1; array2[c] = $2; c++ and iterate with a counter instead of using in: for (i=0; i<c; i++).

Of course, you should choose meaningful array names.

Upvotes: 2

Related Questions