Dickey Singh
Dickey Singh

Reputation: 713

aggregation and statistical functions on NOSQL databases

Using SQL databases, it is easy to do statistical / aggregate functions like covariance, standard deviation, kurtosis, skewness, deviations, means and medians, summation and product etc, without taking the data out to an application server. http://www.xarg.org/2012/07/statistical-functions-in-mysql/

How are such computations done effectively (as close as possible to the store, assuming map/reduce "jobs" won't be realtime) on NoSql databases in general and dynamodb(cassandra) in particular, for large datasets.

AWS RDS (MySQL, PostgresSQL, ...) is, well, not NoSQL and Amazon Redshift (ParAccel) - a column store - has a SQL interface and may be an overkill ($6.85/hr). Redshift has limited aggregation functionality (http://docs.aws.amazon.com/redshift/latest/dg/c_Aggregate_Functions.html, http://docs.aws.amazon.com/redshift/latest/dg/c_Window_functions.html)

Upvotes: 1

Views: 1357

Answers (3)

Chen Harel
Chen Harel

Reputation: 10052

MongoDB has some aggregation capabilities that might fit your needs http://docs.mongodb.org/manual/aggregation/

Upvotes: 1

AndySavage
AndySavage

Reputation: 1769

For DB's which have no aggregate functionality (e.g. Cassandra) you are always going to have to pull some data out. Building distributed computation clusters close to your DB is a popular option at the moment (using projects such as Storm). This way you can request and process data in parallel to do your operations. Think of it as a "real time" Hadoop (though it isn't the same).

Implementing such a setup is obviously more complicated than having a system that supports it out of the box, so factor that into your decision. The upside is that, if needed, a cluster allows you to do perform complex custom analysis way beyond anything that will be supported in a traditional DB solution.

Upvotes: 2

Vladislav Rastrusny
Vladislav Rastrusny

Reputation: 29985

Well, in MongoDB you have a possibility to create a some kind of UDF:

db.system.js.save( { _id : "Variance" ,
value : function(key,values)
{
    var squared_Diff = 0;
    var mean = Avg(key,values);
    for(var i = 0; i < values.length; i++)
    {
        var deviation = values[i] - mean;
        squared_Diff += deviation * deviation;
    }
    var variance = squared_Diff/(values.length);
    return variance;
}});


db.system.js.save( { _id : "Standard_Deviation"
, value : function(key,values)
{
    var variance = Variance(key,values);
    return Math.sqrt(variance);
}});

The description is here.

Upvotes: 1

Related Questions