Ioannis Pappas
Ioannis Pappas

Reputation: 817

Column Operations in file Linux Shell

I have a file with entries seperated by an empty space. For example:

example.txt

24676 256 218503341 2173
13236272 500 1023073758 5089
2230304 96 15622969 705
0 22 0 526
13277 28 379182 141

I would like to print, in the command line, the outcome of "column 1/ column 3" or simila. I believe it can be done with awk. However, some entries are 0, hence division by 0 gives:

fatal: division by zero attempted

In a more advanced case, I would like to find the median value (or some percentile) of the division.

Upvotes: 1

Views: 4123

Answers (2)

Jonathan Leffler
Jonathan Leffler

Reputation: 753585

There are many ways to ignore the row with a zero divisor, including:

awk '$3 != 0 { print $1/$3 }' your-data-file

awk '{ if ($3 != 0) print $1/$3 }' your-data-file

The question changed — to print 0 instead. The answer is not much harder:

awk '{ if ($3 != 0) print $1/$3; else print 0 }' your-data-file

Medians and other percentiles are much fiddlier to deal with. It's easiest if the data is in sorted order. So much easier that I'd expect to use a numeric sort and then process the data from there.


I dug out an old shell script which computes descriptive statistics - min, max, mode, median, and deciles of a single numeric column of data:

:   "@(#)$Id: dstats.sh,v 1.2 1997/06/02 21:45:00 johnl Exp $"
#
#   Calculate Descriptive Statistics: min, max, median, mode, deciles

sort -n $* |
awk 'BEGIN { max = -999999999; min = 999999999; }
    {   # Accumulate basic data
        count[$1]++;
        item[++n] = $1;
        if ($1 > max) max = $1;
        if ($1 < min) min = $1;
    }
END {   # Print Descriptive Statistics
        printf("# Count = %d\n", n);
        printf("# Min = %d\n", min);
        decile = 1;
        for (decile = 10; decile < 100; decile += 10)
        {
            idx = int((decile * n) / 100) + 1;
            printf("# %d%% decile = %d\n", decile, item[idx]);
            if (decile == 50)
                median = item[idx];
        }
        printf("# Max = %d\n", max);

        printf("# Median = %d\n", median);
        for (i in count)
        {
            if (count[i] > count[mode])
                mode = i;
        }
        printf("# Mode = %d\n", mode);
    }'

The initial values of min and max are not exactly scientific. It serves to illustrate a point.

(This 1997 version is almost identical to its 1991 predecessor - all except for the version information line is identical, in fact. So, the code is over 20 years old.)

Upvotes: 3

Adam Liss
Adam Liss

Reputation: 48290

Here's one solution:

awk '
  $3 != 0 { vals[$NR]=$1/$3; sum += vals[$NR]; print vals[$NR] }
  $3 == 0 { vals[$NR]=0; print "skipping division by 0" }
  END { sort vals; print "Mean = " sum/$NR ", Median ~ " vals[$NR/2] }
  ' < your_file

This will calculate, print, and accumulate the quotients if the 3rd column is not zero. When it reaches the end of your file (which should not have an empty line), it will print the mean and median of all the quotients, assuming 0 for each line in which it would have divided by zero.

In awk, $n means the nth field, starting with 1, and $NR means the number of records (that is, the number of lines) that have been processed. Each quotient is stored in the array vals, enabling us to calculate the median value.

In real life, the median is defined as the "middle" item given an odd number of elements, or the mean of the two "middle" items given an even number of elements.

And you're on your own when it comes to implementing the sort function!

Upvotes: 2

Related Questions