Reputation: 5652
I'm looking for a statistics package for Perl (CPAN is fine) that allows me to add data incrementally instead of having to pass in an entire array of data.
Just the mean, median, stddev, max, and min is necessary, nothing too complicated.
The reason for this is because my dataset is entirely too large to fit into memory. The data source is in a MySQL database, so right now I'm just querying a subset of the data and computing the statistics for them, then combining all the manageable subsets later.
If you have other ideas on how to overcome this issue, I'd be much obliged!
Upvotes: 9
Views: 2838
Reputation: 27
Except for computing median, you don't need to store the entire data set. You can compute the other stats by just recomputing a few values with every new piece of data. For every new data input just keep track of current n, sum, sum_of_squares, min, and max.
Notes on math for variance: Where mu is mean. Note most math book show the population variance as: variance = sigma_squared = (1/N) * (SUM(xn - mu)**2))
= (1/N) * ( SUM(xn**2 - 2*mu*xn + mu**2) )
= (1/N) * ( SUM(xn**2) - 2*mu*SUM(xn) + N*mu**2)
= SUM(xn**2)/N - 2*mu*SUM(xn)/N + N*mu**2/N
= SUM(xn**2)/N - 2*mu**2 + mu**2
= SUM(xn**2)/N - mu**2
= SUM(xn**2)/N - (sum/N)**2
= SUM(xn**2)/N - ( sum**2 / N**2)
= 1/N * ( SUM(xn**2) - (sum**2)/N)
Upvotes: 0
Reputation: 22570
PDL might provide a possible solution:
Have a look at this previous SO answer which shows how to get means, std dev, etc.
Here is relevant part of code repeated here:
use strict;
use warnings;
use PDL;
my $figs = pdl [
[0.01, 0.01, 0.02, 0.04, 0.03],
[0.00, 0.02, 0.02, 0.03, 0.02],
[0.01, 0.02, 0.02, 0.03, 0.02],
[0.01, 0.00, 0.01, 0.05, 0.03],
];
my ( $mean, $prms, $median, $min, $max, $adev, $rms ) = statsover( $figs );
Upvotes: 4
Reputation: 5318
I understand it's 4 years down the road, but in case anyone is interested, there is now a module for memory-efficient approximate statistical sample analysis. It's interface generally follows that of Statistics::Descriptive and co.
It divides the sample into logarithmic intervals, and only keeps hit counts. Thus a fixed relative error is introduced (precision can be adjusted in new()), however large amounts of data may be processed without using much memory.
Upvotes: 0
Reputation: 3016
@DVK: The one-pass algorithms for calculating mean and standard deviation here http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#On-line_algorithm are not approximations, and are more numerically robust than the example you give. See references on that page.
Upvotes: 1
Reputation: 5082
This is largely untested, so use with care. Since I have bad memory, I checked the algorithm against Wikipedia. I'm not aware of an algorithm to calculate the median from a stream of numbers, but that doesn't mean there isn't one.
#!/usr/bin/perl
use strict;
use warnings;
use MooseX::Declare;
class SimpleStats {
has 'min' => (is => 'rw', isa => 'Num', default => 9**9**9);
has 'max' => (is => 'rw', isa => 'Num', default => -9**9**9);
has 'A' => (is => 'rw', isa => 'Num', default => 0);
has 'Q' => (is => 'rw', isa => 'Num', default => 0);
has 'n' => (is => 'rw', isa => 'Int', default => 0);
has 'n_nonzero' => (is => 'rw', isa => 'Int', default => 0);
has 'sum_w' => (is => 'rw', isa => 'Int', default => 0);
method add (Num $x, Num $w = 1) {
$self->min($x) if $x < $self->min;
$self->max($x) if $x > $self->max;
my $n = $self->n;
if ($n == 0) {
$self->A($x);
$self->sum_w($w);
}
else {
my $A = $self->A;
my $Q = $self->Q;
my $sum_w_before = $self->sum_w;
$self->sum_w($sum_w_before+$w);
$self->A($A + ($x-$A) * $w/$self->sum_w);
$self->Q($Q + $w*($x-$A)*($x-$self->A));
}
$self->n($n+1);
$self->n_nonzero($self->n_nonzero+1) if $w != 0;
return();
}
method mean () { $self->A }
method sample_variance () {
$self->Q * $self->n_nonzero() /
( ($self->n_nonzero-1) * $self->sum_w )
}
method std_variance () { $self->Q / $self->sum_w }
method std_dev () { sqrt($self->std_variance) }
# slightly evil. Just don't reuse objects
method reset () { %$self = %{__PACKAGE__->new()} }
}
package main;
my $stats = SimpleStats->new;
while (<STDIN>) {
s/^\s+//;
s/\s+$//;
my ($x, $w) = split /\s+/, $_;
if (defined $w) {
$stats->add($x, $w);
} else {
$stats->add($x);
}
}
print "Mean: ", $stats->mean, "\n";
print "Sample var: ", $stats->sample_variance, "\n";
print "Std var: ", $stats->std_variance, "\n";
print "Std dev: ", $stats->std_dev, "\n";
print "Entries: ", $stats->n, "\n";
print "Min: ", $stats->min, "\n";
print "Max: ", $stats->max, "\n";
Upvotes: 0
Reputation: 129529
You can not do an exact stddev and a median unless you either keep the whole thing in memory or run through the data twice.
UPDATE While you can not do an exact stddev IN ONE PASS, there's an approximation one-pass algorithm, the link is in a comment to this answer.
The rest are completely trivial (no need for a module) to do in 3-5 lines of Perl. STDDEV/Median can be done in 2 passes fairly trivially as well (I just rolled out a script that did exactly what you described, but for IP reasons I'm pretty sure I'm not allowed to post it as example for you, sorry)
Sample code:
my ($min, $max)
my $sum = 0;
my $count = 0;
while (<>) {
chomp;
my $current_value = $_; #assume input is 1 value/line for simplicity sake
$sum += $current_value;
$count++;
$min = $current_value if (!defined $min || $min > $current_value);
$max = $current_value if (!defined $max || $max < $current_value);
}
my $mean = $sum * 1.0 / $count;
my $sum_mean_diffs_2 = 0;
while (<>) { # Second pass to compute stddev (use for median too)
chomp;
my $current_value = $_;
$sum_mean_diffs += ($current_value - $mean) * ($current_value - $mean);
}
my $std_dev = sqrt($sum_mean_diffs / $count);
# Median is left as excercise for the reader.
Upvotes: 5
Reputation: 47889
Why don't you simply ask the database for the values you are trying to compute?
Amongst others, MySQL features GROUP BY (Aggregate) functions. For missing functions, all you need is a little SQL.
Upvotes: 5
Reputation: 110509
Statistics::Descriptive::Discrete allows you to do this in a manner similar to Statistics::Descriptive, but has been optimized for use with large data sets. (The documentation reports an improvement by two orders of magnitude (100x) in memory usage, for example).
Upvotes: 3