Reputation: 4027
I often find myself with a file that has one number per line. I end up importing it in excel to view things like median, standard deviation and so forth.
Is there a command line utility in linux to do the same? I usually need to find the average, median, min, max and std deviation.
Upvotes: 86
Views: 64445
Reputation: 11
Not enough solutions ?-) : I would like to toss in gnuplot stats command. Gnuplot is a remarkably fast data analytics tool - plotting, regression...
seq 10 | gnuplot -e "stats '-' u 1"
* FILE:
Records: 10
Out of range: 0
Invalid: 0
Header records: 0
Blank: 0
Data Blocks: 1
* COLUMN:
Mean: 5.5000
Std Dev: 2.8723
Sample StdDev: 3.0277
Skewness: 0.0000
Kurtosis: 1.7758
Avg Dev: 2.5000
Sum: 55.0000
Sum Sq.: 385.0000
Mean Err.: 0.9083
Std Dev Err.: 0.6423
Skewness Err.: 0.7746
Kurtosis Err.: 1.5492
Minimum: 1.0000 [ 0]
Maximum: 10.0000 [ 9]
Quartile: 3.0000
Median: 5.5000
Quartile: 8.0000
Upvotes: 1
Reputation: 175
The chosen answer uses R. Using the same tool, I find a script nicer to work with (than a one-liner) as it can be modified more comfortably to add any specific stats, or format the output differently.
Given this file data.txt
:
1
2
3
4
5
6
7
8
9
10
Having this basic-stats
script in $PATH
:
#!/usr/bin/env Rscript
# Build a numeric vector.
x <- as.numeric(readLines("stdin"))
# Custom basic statistics.
basic_stats <- data.frame(
N = length(x), min = min(x), mean = mean(x), median = median(x), stddev = sd(x),
percentile_95 = quantile(x, c(.95)), percentile_99 = quantile(x, c(.99)),
max = max(x))
# Print output.
print(round(basic_stats, 3), row.names = FALSE, right = FALSE)
Execute basic-stats < data.txt
to print to stdout the following:
N min mean median stddev percentile_95 percentile_99 max
10 1 5.5 5.5 3.028 9.55 9.91 10
The formatting can look a bit nicer by replacing the last 2 lines of the script with the following:
# Print output. Tabular formatting is done by the `column` command.
temp_file <- tempfile("basic_stats_", fileext = ".csv")
write.csv(round(basic_stats, 3), file = temp_file, row.names = FALSE, quote = FALSE)
system(paste("column -s, -t", temp_file))
. <- file.remove(temp_file)
This is the output now, with 2 spaces between columns (instead of 1 space):
N min mean median stddev percentile_95 percentile_99 max
10 1 5.5 5.5 3.028 9.55 9.91 10
Upvotes: 0
Reputation: 80415
#!/usr/bin/perl
#
# stdev - figure N, min, max, median, mode, mean, & std deviation
#
# pull out all the real numbers in the input
# stream and run standard calculations on them.
# they may be intermixed with other test, need
# not be on the same or different lines, and
# can be in scientific notion (avagadro=6.02e23).
# they also admit a leading + or -.
#
# Tom Christiansen
# [email protected]
use strict;
use warnings;
use List::Util qw< min max >;
#
my $number_rx = qr{
# leading sign, positive or negative
(?: [+-] ? )
# mantissa
(?= [0123456789.] )
(?:
# "N" or "N." or "N.N"
(?:
(?: [0123456789] + )
(?:
(?: [.] )
(?: [0123456789] * )
) ?
|
# ".N", no leading digits
(?:
(?: [.] )
(?: [0123456789] + )
)
)
)
# abscissa
(?:
(?: [Ee] )
(?:
(?: [+-] ? )
(?: [0123456789] + )
)
|
)
}x;
my $n = 0;
my $sum = 0;
my @values = ();
my %seen = ();
while (<>) {
while (/($number_rx)/g) {
$n++;
my $num = 0 + $1; # 0+ is so numbers in alternate form count as same
$sum += $num;
push @values, $num;
$seen{$num}++;
}
}
die "no values" if $n == 0;
my $mean = $sum / $n;
my $sqsum = 0;
for (@values) {
$sqsum += ( $_ ** 2 );
}
$sqsum /= $n;
$sqsum -= ( $mean ** 2 );
my $stdev = sqrt($sqsum);
my $max_seen_count = max values %seen;
my @modes = grep { $seen{$_} == $max_seen_count } keys %seen;
my $mode = @modes == 1
? $modes[0]
: "(" . join(", ", @modes) . ")";
$mode .= ' @ ' . $max_seen_count;
my $median;
my $mid = int @values/2;
if (@values % 2) {
$median = $values[ $mid ];
} else {
$median = ($values[$mid-1] + $values[$mid])/2;
}
my $min = min @values;
my $max = max @values;
printf "n is %d, min is %g, max is %d\n", $n, $min, $max;
printf "mode is %s, median is %g, mean is %g, stdev is %g\n",
$mode, $median, $mean, $stdev;
Upvotes: 3
Reputation: 489
I found myself wanting to do this in a shell pipeline, and getting all the right arguments for R took a while. Here's what I came up with:
seq 10 | R --slave -e 'x <- scan(file="stdin",quiet=TRUE); summary(x)'
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 3.25 5.50 5.50 7.75 10.00
The --slave
option "Make(s) R run as quietly as possible...It implies --quiet and --no-save." The -e
option tells R to treat the following string as R code. The first statement reads from standard in, and stores what's read in the variable called "x". The quiet=TRUE
option to the scan
function suppresses the writing of a line saying how many items were read. The second statement applies the summary
function to x
, which produces the output.
Upvotes: 10
Reputation: 7714
For the average, median & standard deviation you can use awk
. This will generally be faster than R
solutions. For instance the following will print the average :
awk '{a+=$1} END{print a/NR}' myfile
(NR
is an awk
variable for the number of records, $1
means the first (space-separated) argument of the line ($0
would be the whole line, which would also work here but in principle would be less secure, although for the computation it would probably just take the first argument anyway) and END
means that the following commands will be executed after having processed the whole file (one could also have initialized a
to 0
in a BEGIN{a=0}
statement)).
Here is a simple awk
script which provides more detailed statistics (takes a CSV file as input, otherwise change FS
) :
#!/usr/bin/awk -f
BEGIN {
FS=",";
}
{
a += $1;
b[++i] = $1;
}
END {
m = a/NR; # mean
for (i in b)
{
d += (b[i]-m)^2;
e += (b[i]-m)^3;
f += (b[i]-m)^4;
}
va = d/NR; # variance
sd = sqrt(va); # standard deviation
sk = (e/NR)/sd^3; # skewness
ku = (f/NR)/sd^4-3; # standardized kurtosis
print "N,sum,mean,variance,std,SEM,skewness,kurtosis"
print NR "," a "," m "," va "," sd "," sd/sqrt(NR) "," sk "," ku
}
It is straightforward to add min/max to this script, but it is as easy to pipe sort
& head
/tail
:
sort -n myfile | head -n1
sort -n myfile | tail -n1
Upvotes: 40
Reputation: 51
Also, the self-write stats, (bundled with 'scut') a perl util to do just that. Fed a stream of numbers on STDIN, it tries to reject non-numbers and emits the following:
$ ls -lR | scut -f=4 | stats
Sum 3.10271e+07
Number 452
Mean 68643.9
Median 4469.5
Mode 4096
NModes 6
Min 2
Max 1.01171e+07
Range 1.01171e+07
Variance 3.03828e+11
Std_Dev 551206
SEM 25926.6
95% Conf 17827.9 to 119460
(for a normal distribution - see skew)
Skew 15.4631
(skew = 0 for a symmetric dist)
Std_Skew 134.212
Kurtosis 258.477
(K=3 for a normal dist)
It can also do a number of transforms on the input stream and emit only the unadorned value if you ask it; ie 'stats --mean' will return the mean as an unlabelled float.
Upvotes: 2
Reputation: 951
Yet another tool which could be used for calculating statistics and view distribution in ASCII mode is ministat. It's a tool from FreeBSD, but it also packaged for popular Linux distribution like Debian/Ubuntu. Or you can simply download and build it from sources - it only requires a C compiler and the C standard library.
Usage example:
$ cat test.log
Handled 1000000 packets.Time elapsed: 7.575278
Handled 1000000 packets.Time elapsed: 7.569267
Handled 1000000 packets.Time elapsed: 7.540344
Handled 1000000 packets.Time elapsed: 7.547680
Handled 1000000 packets.Time elapsed: 7.692373
Handled 1000000 packets.Time elapsed: 7.390200
Handled 1000000 packets.Time elapsed: 7.391308
Handled 1000000 packets.Time elapsed: 7.388075
$ cat test.log| awk '{print $5}' | ministat -w 74
x <stdin>
+--------------------------------------------------------------------------+
| x |
|xx xx x x x|
| |__________________________A_______M_________________| |
+--------------------------------------------------------------------------+
N Min Max Median Avg Stddev
x 8 7.388075 7.692373 7.54768 7.5118156 0.11126122
Upvotes: 29
Reputation: 406
Another tool: tsv-summarize, from eBay's tsv utilities. Min, max, mean, median, standard deviation are all supported. Intended for large data sets. Example:
$ seq 10 | tsv-summarize --min 1 --max 1 --median 1 --stdev 1
1 10 5.5 3.0276503541
Disclaimer: I'm the author.
Upvotes: 2
Reputation: 27349
This is a breeze with R. For a file that looks like this:
1
2
3
4
5
6
7
8
9
10
Use this:
R -q -e "x <- read.csv('nums.txt', header = F); summary(x); sd(x[ , 1])"
To get this:
V1
Min. : 1.00
1st Qu.: 3.25
Median : 5.50
Mean : 5.50
3rd Qu.: 7.75
Max. :10.00
[1] 3.02765
-q
flag squelches R's startup licensing and help output-e
flag tells R you'll be passing an expression from the terminalx
is a data.frame
- a table, basically. It's a structure that accommodates multiple vectors/columns of data, which is a little peculiar if you're just reading in a single vector. This has an impact on which functions you can use.summary()
, naturally accommodate data.frames
. If x
had multiple fields, summary()
would provide the above descriptive stats for each.sd()
can only take one vector at a time, which is why I index x
for that command (x[ , 1]
returns the first column of x
). You could use apply(x, MARGIN = 2, FUN = sd)
to get the SDs for all columns.Upvotes: 65
Reputation: 46856
Mean:
awk '{sum += $1} END {print "mean = " sum/NR}' filename
Median:
gawk -v max=128 '
function median(c,v, j) {
asort(v,j)
if (c % 2) return j[(c+1)/2]
else return (j[c/2+1]+j[c/2])/2.0
}
{
count++
values[count]=$1
if (count >= max) {
print median(count,values); count=0
}
}
END {
print "median = " median(count,values)
}
' filename
Mode:
awk '{c[$1]++} END {for (i in count) {if (c[i]>max) {max=i}} print "mode = " max}' filename
This mode calculation requires an even number of samples, but you see how it works...
Standard Deviation:
awk '{sum+=$1; sumsq+=$1*$1} END {print "stdev = " sqrt(sumsq/NR - (sum/NR)**2)}' filename
Upvotes: 18
Reputation: 4700
Using xsv:
$ echo '3 1 4 1 5 9 2 6 5 3 5 9' |tr ' ' '\n' > numbers-one-per-line.csv
$ xsv stats -n < numbers-one-per-line.csv
field,type,sum,min,max,min_length,max_length,mean,stddev
0,Integer,53,1,9,1,1,4.416666666666667,2.5644470922381863
# mode/median/cardinality not shown by default since it requires storing full file in memory:
$ xsv stats -n --everything < numbers-one-per-line.csv | xsv table
field type sum min max min_length max_length mean stddev median mode cardinality
0 Integer 53 1 9 1 1 4.416666666666667 2.5644470922381863 4.5 5 7
Upvotes: 3
Reputation: 1328
Yet another tool: https://www.gnu.org/software/datamash/
# Example: calculate the sum and mean of values 1 to 10:
$ seq 10 | datamash sum 1 mean 1
55 5.5
Might be more commonly packaged (the first tool I found prepackaged for nix at least)
Upvotes: 20
Reputation: 1302
You might also consider using clistats. It is a highly configurable command line interface tool to compute statistics for a stream of delimited input numbers.
NOTE: I'm the author.
Upvotes: 8
Reputation: 27349
data_hacks
is a Python command-line utility for basic statistics.
The first example from that page produces the desired results:
$ cat /tmp/data | histogram.py
# NumSamples = 29; Max = 10.00; Min = 1.00
# Mean = 4.379310; Variance = 5.131986; SD = 2.265389
# each * represents a count of 1
1.0000 - 1.9000 [ 1]: *
1.9000 - 2.8000 [ 5]: *****
2.8000 - 3.7000 [ 8]: ********
3.7000 - 4.6000 [ 3]: ***
4.6000 - 5.5000 [ 4]: ****
5.5000 - 6.4000 [ 2]: **
6.4000 - 7.3000 [ 3]: ***
7.3000 - 8.2000 [ 1]: *
8.2000 - 9.1000 [ 1]: *
9.1000 - 10.0000 [ 1]: *
Upvotes: 8
Reputation: 41
There is also simple-r, which can do almost everything that R can, but with less keystrokes:
https://code.google.com/p/simple-r/
To calculate basic descriptive statistics, one would have to type one of:
r summary file.txt
r summary - < file.txt
cat file.txt | r summary -
For each of average, median, min, max and std deviation, the code would be:
seq 1 100 | r mean -
seq 1 100 | r median -
seq 1 100 | r min -
seq 1 100 | r max -
seq 1 100 | r sd -
Doesn't get any simple-R!
Upvotes: 3
Reputation: 611
Using "st" (https://github.com/nferraz/st)
$ st numbers.txt
N min max sum mean stddev
10 1 10 55 5.5 3.02765
Or:
$ st numbers.txt --transpose
N 10
min 1
max 10
sum 55
mean 5.5
stddev 3.02765
(DISCLAIMER: I wrote this tool :))
Upvotes: 49
Reputation: 91
Just in case, there's datastat
, a simple program for Linux computing simple statistics from the command-line. For example,
cat file.dat | datastat
will output the average value across all rows for each column of file.dat. If you need to know the standard deviation, min, max, you can add the --dev
, --min
and --max
options, respectively.
datastat
has the possibility to aggregate rows based on the value of one or more "key" columns. For example,
cat file.dat | datastat -k 1
will produce, for each different value found on the first column (the "key"), the average of all other column values as aggregated among all rows with the same value on the key. You can use more columns as key fields (e.g., -k 1-3, -k 2,4 etc...).
It's written in C++, runs fast and with small memory occupation, and can be piped nicely with other tools such as cut
, grep
, sed
, sort
, awk
etc.
Upvotes: 9
Reputation: 4870
Yep, it's called perl
and here is concise one-liner:
perl -e 'use List::Util qw(max min sum); @a=();while(<>){$sqsum+=$_*$_; push(@a,$_)}; $n=@a;$s=sum(@a);$a=$s/@a;$m=max(@a);$mm=min(@a);$std=sqrt($sqsum/$n-($s/$n)*($s/$n));$mid=int @a/2;@srtd=sort @a;if(@a%2){$med=$srtd[$mid];}else{$med=($srtd[$mid-1]+$srtd[$mid])/2;};print "records:$n\nsum:$s\navg:$a\nstd:$std\nmed:$med\max:$m\min:$mm";'
Example
$ cat tt
1
3
4
5
6.5
7.
2
3
4
And the command
cat tt | perl -e 'use List::Util qw(max min sum); @a=();while(<>){$sqsum+=$_*$_; push(@a,$_)}; $n=@a;$s=sum(@a);$a=$s/@a;$m=max(@a);$mm=min(@a);$std=sqrt($sqsum/$n-($s/$n)*($s/$n));$mid=int @a/2;@srtd=sort @a;if(@a%2){$med=$srtd[$mid];}else{$med=($srtd[$mid-1]+$srtd[$mid])/2;};print "records:$n\nsum:$s\navg:$a\nstd:$std\nmed:$med\max:$m\min:$mm";'
records:9
sum:35.5
avg:3.94444444444444
std:1.86256162380447
med:4
max:7.
min:1
Upvotes: 21