mafugu
mafugu

Reputation: 49

Calculating distances with awk

I have text files as shown below.

 CA     21.660  -6.795  11.323 
  C     28.811  -9.801  16.262
  O     23.221  -9.266  13.799
  CB    33.528 -11.934  17.900
  N     21.660  -6.795  11.323
  O     32.410  -8.539  16.566

I want to calculate the distances between the coordinates of atoms. For example, I would like to read all files in a folder and calculate the distances between first and second atom,first and third, first and fourth, etc.Then, second and third, second and fourth, second and fifth etc.The formula is SQRT ((X1-X2)^2 + (Y1-Y2)^2 + (Z1-Z2)^2). I would like to save the outputs from each file to another folder with the names of input files. How can I do this with awk?

desired output

CA-C  4.52
CA-O  3.80
CA-CB 5.68
CA-N  8.94
--
--
--
N-O   5.98 

your help would be appreciated!!

Upvotes: 2

Views: 1462

Answers (5)

Chris Charley
Chris Charley

Reputation: 6613

A Perl solution could be:

#!/usr/bin/perl
use strict;
use warnings;

my @data = map [split], <DATA>;

for (my $i = 0; $i < @data; $i++) {
    for (my $j = $i+1; $j < @data; $j++) {
        my $d = distance( @data[$i, $j]);
        printf("%-6s%7.4f\n",
            join("-", map $_->[0], @data[$i, $j]), $d) if $d <= 6;
    }
}

sub distance {
    my ($coord1, $coord2) = @_;
    my $sum;
    for my $k (1 .. $#$coord1) {
        $sum += ($coord1->[$k] - $coord2->[$k])**2;
    }
    return sqrt $sum;
}

__DATA__
CA     21.660  -6.795  11.323 
C     28.811  -9.801  16.262
O     23.221  -9.266  13.799
CB    33.528 -11.934  17.900
N     21.660  -6.795  11.323
O     32.410  -8.539  16.566

Upvotes: 0

Vijay
Vijay

Reputation: 67291

Below is the code in awk:

awk '{a[NR]=$0}
     END
     { 
       for(i=1;i<=NR;i++)
        {split(a[i],k);
         for(j=i+1;j<=NR;j++)
          {
           split(a[j],l);
           dist=(k[2]-l[2])*(k[2]-l[2])+(k[3]-l[3])*(k[3]-l[3])+(k[4]-l[4])*(k[4]-l[4]);
           print k[1]"-"l[1],sqrt(dist);
          }
        }
     }' your_file

And Below is the Test:

> cat temp
 CA     21.660  -6.795  11.323 
  C     28.811  -9.801  16.262
  O     23.221  -9.266  13.799
  CB    33.528 -11.934  17.900
  N     21.660  -6.795  11.323
  O     32.410  -8.539  16.566

Execution:

> awk '{a[NR]=$0}END{for(i=1;i<=NR;i++){split(a[i],k);for(j=i+1;j<=NR;j++){split(a[j],l);dist=(k[2]-l[2])*(k[2]-l[2])+(k[3]-l[3])*(k[3]-l[3])+(k[4]-l[4])*(k[4]-l[4]);print k[1]"-"l[1],sqrt(dist);}}}' temp
CA-C 9.19601
CA-O 3.83055
CA-CB 14.5092
CA-N 0
CA-O 12.0869
C-O 6.13194
C-CB 5.42981
C-N 9.19601
C-O 3.82595
O-CB 11.4092
O-N 3.83055
O-O 9.62406
CB-N 14.5092
CB-O 3.81517
N-O 12.0869
> 

Upvotes: 1

Steve
Steve

Reputation: 54512

Here's one way using GNU awk:

awk 'FNR==NR { a[NR]=$0; next } { for (i=FNR+1;i<=NR-1;i++) { split(a[i],b); print $1 "-" b[1], sqrt(($2-b[2])^2 + ($3-b[3])^2 + ($4-b[4])^2) | "column -t" } NR--}' file file

It does what you want, but either the algorithm you've provided is different from what you require, or, your expected output has been incorrectly calculated (I'm assuming the latter is the problem). Anyway, here's the results:

CA-C   9.19601
CA-O   3.83055
CA-CB  14.5092
CA-N   0
CA-O   12.0869
C-O    6.13194
C-CB   5.42981
C-N    9.19601
C-O    3.82595
O-CB   11.4092
O-N    3.83055
O-O    9.62406
CB-N   14.5092
CB-O   3.81517
N-O    12.0869

If you need to perform this on multiple files in the present working directory, and assuming there's nothing but files of interest in this directory, you can wrap a for loop around the awk statement. Obviously, you'll need to change /path/to/folder/ to your path of choice for it to work correctly:

for i in *; do awk 'FNR==NR { a[NR]=$0; next } { for (i=FNR+1;i<=NR-1;i++) { split(a[i],b); print $1 "-" b[1], sqrt(($2-b[2])^2 + ($3-b[3])^2 + ($4-b[4])^2) | "column -t > /path/to/folder/" FILENAME } NR--}' "$i"{,}; done

Upvotes: 4

colemar
colemar

Reputation: 125

If atoms is a file containing your data

awk '{ p[NR,0]=$1;p[NR,1]=$2;p[NR,2]=$3;p[NR,3]=$4; for (j=1;j<=NR-1;j++) print p[j,0]"-"$1,sqrt((p[NR,1]-p[j,1])^2+(p[NR,2]-p[j,2])^2+(p[NR,3]-p[j,3])^2) }' atoms 
CA-C 9.19601
CA-O 3.83055
C-O 6.13194
CA-CB 14.5092
C-CB 5.42981
O-CB 11.4092
CA-N 0
C-N 9.19601
O-N 3.83055
CB-N 14.5092
CA-O 12.0869
C-O 3.82595
O-O 9.62406
CB-O 3.81517
N-O 12.0869

There is a problem:

  • your data contains two identical O atoms, therefore it is hard to say which is which in the output

Upvotes: 3

Ed Morton
Ed Morton

Reputation: 204015

Something like this sounds like what you want, but clearly neither of the results match what you say they should so clarify your algorithm:

$ awk 'NR>1{print p[1]"-"$1,sqrt((p[2]-$2)^2 + (p[3]-$3)^2 + (p[4]-$4)^2)} {split($0,p) }' file
CA-C 9.19601
C-O 6.13194
O-CB 11.4092
CB-N 14.5092
N-O 12.0869

$ awk 'NR>1{print p[1]"-"$1,sqrt(($2-p[2])^2 + ($3-p[3])^2 + ($4-p[4])^2)} {split($0,p) }' file
CA-C 9.19601
C-O 6.13194
O-CB 11.4092
CB-N 14.5092
N-O 12.0869

Upvotes: 1

Related Questions