heuristicus
heuristicus

Reputation: 1138

In a column of numbers, find the closest value to some target value

Let's say I have some numerical data in columns, something like

11.100000 36.829657 6.101642
11.400000 36.402069 5.731998
11.700000 35.953025 5.372652
12.000000 35.482082 5.023737
12.300000 34.988528 4.685519
12.600000 34.471490 4.358360
12.900000 33.930061 4.042693
13.200000 33.363428 3.738985
13.500000 32.770990 3.447709
13.800000 32.152473 3.169312

I also have a single target value and a column index. Given this set of data, I want to find the closest value to the target value in the column with the specified index.

For example, If my target value is 11.6 in column 1, then the script should output 11.7. If there are two numbers equidistant from the target value, then the higher value should be output.

I have a feeling that awk has the necessary functionality to do this, but any solution that works in a bash script is welcome.

Upvotes: 3

Views: 5113

Answers (3)

Kent
Kent

Reputation: 195039

try this:

awk -v c=2 -v t=35 'NR==1{d=$c-t;d=d<0?-d:d;v=$c;next}{m=$c-t;m=m<0?-m:m}m<d{d=m;v=$c}END{print v}' file

the -v c=2 and -v t=35 could be dynamic value. they are the column idx (c) and your target value (t). in the above line, the parameter is column 2 and target 25. They could be shell variable.

the output of above line based on given input data is:

kent$  awk -v c=2 -v t=35 'NR==1{d=$c-t;d=d<0?-d:d;v=$c;next}{m=$c-t;m=m<0?-m:m}m<d{d=m;v=$c}END{print v}' f
34.988528

kent$  awk -v c=1 -v t=11.6 'NR==1{d=$c-t;d=d<0?-d:d;v=$c;next}{m=$c-t;m=m<0?-m:m}m<d{d=m;v=$c}END{print v}' f
11.700000

EDIT

If there are two numbers equidistant from the target value, then the higher value should be output

The above codes didn't check this requirement.... the below one should work:

awk -v c=1 -v t=11.6 '{a[NR]=$c}END{
        asort(a);d=a[NR]-t;d=d<0?-d:d;v = a[NR]
        for(i=NR-1;i>=1;i--){
                m=a[i]-t;m=m<0?-m:m
                if(m<d){
                    d=m;v=a[i]
                }
        }
        print v
}' file

test:

kent$  awk -v c=1 -v t=11.6 '{a[NR]=$c}END{
        asort(a);d=a[NR]-t;d=d<0?-d:d;v = a[NR]
        for(i=NR-1;i>=1;i--){
                m=a[i]-t;m=m<0?-m:m
                if(m<d){
                    d=m;v=a[i]
                }
        }
        print v
}' f
11.700000

short explanation.

I won't explain each line of code, what it does. just tell a bit the idea to do the job.

  • first read all element in the given column, save in an array
  • sort the array.
  • take the last element from the array(the greatest number). assign it to var v, and calculate the diff between it and the given target, save it(absolute value) in d
  • from the 2nd last element from the array loop to the first. if diff between element and target (absolute value) is less than d, overwrite d with diff, also save current element into v
  • print v, after looping, v is the answer.

some note:

  • there is room to optimize the logic. e.g. we don't have to loop thru the whole array. just compare the d(abs), if new diff > d, we can stop the loop.
  • due to the sort, this algorithm is O(nlogn). in fact this problem could be solved by O(n). If your input data were huge, and with a worst case(e.g. your column has value in range 500-99999999999, but your target is 1.) you may want to avoid the sort. but I assume the performance is not an issue by you.

Upvotes: 9

fedorqui
fedorqui

Reputation: 289555

Let's try another way, although Kent's answer must be shorter and sharper :)

awk -vc=1 -vv=13.6 '
    BEGIN{l=$c; ld=99}
    {d=($c-v>=0) ? ($c-v) : v-$c; if (d <= ld) {ld=d; l=$c}}
    END{print l}' file

We provide the c (=column) and v (=value) parameters in the beginning.

Then we keep track of the lower value l and the lowest distance ld. For each value we calculate the distance d to the value and if it is lower to the previous ld, we swap and save the new minimal value in l. Finally we print l.

The d=($c-v>=0) ? ($c-v) : v-$c is a way to save the distance as a absolute value: if c-v is negative, save it as positive. It is based on the value=(condition) ? if yes : else structure.

Tests

$ awk -vc=2 -vv=13.6 'BEGIN{l=$c; ld=99} {d=($c-v>=0) ? ($c-v) : v-$c; if (d <= ld) {ld=d; l=$c}} END{print l}' file
32.152473
$ awk -vc=3 -vv=10.6 'BEGIN{l=$c; ld=99} {d=($c-v>=0) ? ($c-v) : v-$c; if (d <= ld) {ld=d; l=$c}} END{print l}' file
3.169312

Upvotes: 2

choroba
choroba

Reputation: 241828

Perl solution:

#!/usr/bin/perl
use warnings;
use strict;

@ARGV == 2 or die "Usage: closest column value < input\n";
my ($column, $target) = (shift, shift);
my $closest;
while (<>) {
    my $value = (split)[$column - 1];
    if ($. == 1
        or abs($closest - $target) >  abs($target - $value)
        or abs($closest - $target) == abs($target - $value)
           && $value > $closest) {
        $closest = $value;
    }
}
print $closest, "\n";

Note that using float == float might not work (What Every Computer Scientist Should Know About Floatin-Point Arithmetic). You might need something like abs(abs($closest - $target) - abs($target - $value)) < 1e-14.

Upvotes: 1

Related Questions