Reputation: 1064
I'd like to calculate the variance of a table column. E.g. the formula to calculate the variance contains aggregate and scalar functions and is:
(SUM((var-AVG(var))*(var-AVG(var))))/(COUNT(var)-1)
where var is the variable-column to calculate the variance.
What is the best way to use this function in Sqlite, like:
SELECT (SUM((var-AVG(var))*(var-AVG(var))))/(COUNT(var)-1) AS Variance FROM
TableX
Upvotes: 2
Views: 7132
Reputation: 487
I needed to calculate 5min averages, and wanted to calculate the variance, on those same 5min intervals.
In short :
COUNT()-1
.Maybe not very elegant, but easier for me to catch. I checked the results with LibreOffice Calc (VAR.P function).
The data
The data are in a table called magic
.
CREATE TABLE magic(
"dt" TEXT,
"qccm" REAL,
PRIMARY KEY (dt)
);
Create a view of the 5min averages
CREATE VIEW v_magic5m(
"dt",
"qccm",
)
AS SELECT
datetime(round(cast(strftime('%s',dt) AS INT)/300)*300, 'unixepoch') AS rndtime,
avg(qccm),
GROUP BY rndtime;
Calculate the variance
WITH tmp
AS (
SELECT
datetime(round(cast(strftime('%s',dt) AS INT)/300)*300, 'unixepoch') AS rnd_dt,
qccm
FROM magic
)
SELECT
tmp.rnd_dt,
v_magic5m.qccm AS avg_qccm,
SUM( (tmp.qccm - v_magic5m.qccm)*(tmp.qccm - v_magic5m.qccm) ) / COUNT(tmp.qccm) AS var_qccm
FROM tmp
LEFT JOIN v_magic5m ON tmp.rnd_dt=v_magic5m.dt
GROUP BY tmp.rnd_dt;
And the output looks like this :
rnd_dt avg_qccm var_qccm
------------------- ---------------- ----------------
2021-04-28 09:55:00 334.292929292929 765.924293439445
2021-04-28 10:00:00 332.743333333333 571.924122222222
2021-04-28 10:05:00 333.04 501.165066666667
Upvotes: 0
Reputation: 2204
In partial response to CL's answer above, I am not sure whether the SELECT AVG(var) FROM TableX is cached or calculated twice (because it appears twice). For large tables/ views, this may become an issue. You may like an alternative:-
SELECT diff_squared / [n-1] FROM
(
SELECT diff * diff AS diff_squared FROM
(
SELECT x - mean AS diff FROM
(
SELECT [Var] AS x FROM TableX
) innerData
LEFT JOIN
(
SELECT avg(x) AS mean FROM
(
SELECT [Var] AS x FROM TableX
)
) stats
) outerData
) LEFT JOIN
(
SELECT count(x) - 1 AS [n-1] FROM
(
SELECT [Var] AS x FROM TableX
)
) nMinus1
Upvotes: 0
Reputation: 180162
The value of AVG(var)
is to be used in each expression to be summed, so you have to compute it separately with a subquery:
SELECT SUM((var-(SELECT AVG(var) FROM TableX))*
(var-(SELECT AVG(var) FROM TableX)) ) / (COUNT(var)-1) AS Variance
FROM TableX
Upvotes: 4