Reputation: 817
I have a file with entries seperated by an empty space. For example:
example.txt
24676 256 218503341 2173
13236272 500 1023073758 5089
2230304 96 15622969 705
0 22 0 526
13277 28 379182 141
I would like to print, in the command line, the outcome of "column 1/ column 3" or simila. I believe it can be done with awk. However, some entries are 0, hence division by 0 gives:
fatal: division by zero attempted
In a more advanced case, I would like to find the median value (or some percentile) of the division.
Upvotes: 1
Views: 4123
Reputation: 753585
There are many ways to ignore the row with a zero divisor, including:
awk '$3 != 0 { print $1/$3 }' your-data-file
awk '{ if ($3 != 0) print $1/$3 }' your-data-file
The question changed — to print 0 instead. The answer is not much harder:
awk '{ if ($3 != 0) print $1/$3; else print 0 }' your-data-file
Medians and other percentiles are much fiddlier to deal with. It's easiest if the data is in sorted order. So much easier that I'd expect to use a numeric sort and then process the data from there.
I dug out an old shell script which computes descriptive statistics - min, max, mode, median, and deciles of a single numeric column of data:
: "@(#)$Id: dstats.sh,v 1.2 1997/06/02 21:45:00 johnl Exp $"
#
# Calculate Descriptive Statistics: min, max, median, mode, deciles
sort -n $* |
awk 'BEGIN { max = -999999999; min = 999999999; }
{ # Accumulate basic data
count[$1]++;
item[++n] = $1;
if ($1 > max) max = $1;
if ($1 < min) min = $1;
}
END { # Print Descriptive Statistics
printf("# Count = %d\n", n);
printf("# Min = %d\n", min);
decile = 1;
for (decile = 10; decile < 100; decile += 10)
{
idx = int((decile * n) / 100) + 1;
printf("# %d%% decile = %d\n", decile, item[idx]);
if (decile == 50)
median = item[idx];
}
printf("# Max = %d\n", max);
printf("# Median = %d\n", median);
for (i in count)
{
if (count[i] > count[mode])
mode = i;
}
printf("# Mode = %d\n", mode);
}'
The initial values of min
and max
are not exactly scientific. It serves to illustrate a point.
(This 1997 version is almost identical to its 1991 predecessor - all except for the version information line is identical, in fact. So, the code is over 20 years old.)
Upvotes: 3
Reputation: 48290
Here's one solution:
awk '
$3 != 0 { vals[$NR]=$1/$3; sum += vals[$NR]; print vals[$NR] }
$3 == 0 { vals[$NR]=0; print "skipping division by 0" }
END { sort vals; print "Mean = " sum/$NR ", Median ~ " vals[$NR/2] }
' < your_file
This will calculate, print, and accumulate the quotients if the 3rd column is not zero. When it reaches the end of your file (which should not have an empty line), it will print the mean and median of all the quotients, assuming 0 for each line in which it would have divided by zero.
In awk
, $n
means the n
th field, starting with 1, and $NR
means the number of records (that is, the number of lines) that have been processed. Each quotient is stored in the array vals
, enabling us to calculate the median value.
In real life, the median is defined as the "middle" item given an odd number of elements, or the mean of the two "middle" items given an even number of elements.
And you're on your own when it comes to implementing the sort
function!
Upvotes: 2