Johannes
Johannes

Reputation: 1064

Calculate variance in sqlite

I'd like to calculate the variance of a table column. E.g. the formula to calculate the variance contains aggregate and scalar functions and is:

(SUM((var-AVG(var))*(var-AVG(var))))/(COUNT(var)-1)

where var is the variable-column to calculate the variance.

What is the best way to use this function in Sqlite, like:

SELECT (SUM((var-AVG(var))*(var-AVG(var))))/(COUNT(var)-1) AS Variance FROM
TableX

Upvotes: 2

Views: 7132

Answers (3)

dpeng
dpeng

Reputation: 487

I needed to calculate 5min averages, and wanted to calculate the variance, on those same 5min intervals.

In short :

  1. create a view of the 5min averages
  2. use this view to compute the variance much like CL above in this thread, although I don't agree with the -1 in COUNT()-1.

Maybe not very elegant, but easier for me to catch. I checked the results with LibreOffice Calc (VAR.P function).

The data

The data are in a table called magic.

CREATE TABLE magic(
  "dt" TEXT,
  "qccm" REAL,
  PRIMARY KEY (dt)
);

Create a view of the 5min averages

CREATE VIEW v_magic5m(
"dt",
"qccm",
)
AS SELECT
datetime(round(cast(strftime('%s',dt) AS INT)/300)*300, 'unixepoch') AS rndtime,
avg(qccm),
GROUP BY rndtime;

Calculate the variance

WITH tmp
AS (
    SELECT
    datetime(round(cast(strftime('%s',dt) AS INT)/300)*300, 'unixepoch') AS rnd_dt, 
    qccm
    FROM magic
)
SELECT
    tmp.rnd_dt,
    v_magic5m.qccm AS avg_qccm,
    SUM( (tmp.qccm - v_magic5m.qccm)*(tmp.qccm - v_magic5m.qccm) ) / COUNT(tmp.qccm) AS var_qccm
FROM tmp
LEFT JOIN v_magic5m ON tmp.rnd_dt=v_magic5m.dt
GROUP BY tmp.rnd_dt;

And the output looks like this :

rnd_dt               avg_qccm          var_qccm        
-------------------  ----------------  ----------------
2021-04-28 09:55:00  334.292929292929  765.924293439445
2021-04-28 10:00:00  332.743333333333  571.924122222222
2021-04-28 10:05:00  333.04            501.165066666667

Upvotes: 0

Phil Whittington
Phil Whittington

Reputation: 2204

In partial response to CL's answer above, I am not sure whether the SELECT AVG(var) FROM TableX is cached or calculated twice (because it appears twice). For large tables/ views, this may become an issue. You may like an alternative:-

SELECT diff_squared / [n-1] FROM
(
    SELECT diff * diff AS diff_squared FROM 
    (
        SELECT x - mean AS diff FROM 
        (
            SELECT [Var] AS x FROM TableX
        ) innerData
        LEFT JOIN
        (
            SELECT avg(x) AS mean FROM
            (
                SELECT [Var] AS x FROM TableX
            ) 
        ) stats
    ) outerData
) LEFT JOIN
(
    SELECT count(x) - 1 AS [n-1] FROM 
    (
        SELECT [Var] AS x FROM TableX
    )   
) nMinus1

Upvotes: 0

CL.
CL.

Reputation: 180162

The value of AVG(var) is to be used in each expression to be summed, so you have to compute it separately with a subquery:

SELECT SUM((var-(SELECT AVG(var) FROM TableX))*
           (var-(SELECT AVG(var) FROM TableX)) ) / (COUNT(var)-1) AS Variance
FROM TableX

Upvotes: 4

Related Questions