sprocket
sprocket

Reputation: 1277

How to emulate linear regression (REGR_) functions in BigQuery?

Some SQL databases provide linear regression over pairs of values with a family of functions called REGR_SLOPE, REGR_R2, REGR_SXX, etc. How can a BigQuery user emulate those functions?

Upvotes: 3

Views: 1176

Answers (1)

sprocket
sprocket

Reputation: 1277

Editor's edit: Please see linear regression is now natively supported in BigQuery. --Fh


The REGR_ functions can be translated into expressions involving aggregates that BigQuery implements, but a little care should be taken to handle NULLs. The REGR_ functions take pairs consisting of an independent variable, let's call it "x", and a dependent variable "y" and ignore the pairs where either variable is NULL.

The following query computes all of the REGR_ expressions at once. It's written in BigQuery standard SQL and takes advantage of user-defined SQL functions to make the query more concise. By testing x and y for NULL-ness within the expression, this version preserves keys. In other words, if all of the pairs for some key contain a NULL, the result will be NULL.

CREATE TEMPORARY FUNCTION NonNullX(y FLOAT64, x FLOAT64)
RETURNS FLOAT64
AS (IF(x IS NOT NULL AND y IS NOT NULL, x, NULL));

CREATE TEMPORARY FUNCTION NonNullY(y FLOAT64, x FLOAT64)
RETURNS FLOAT64
AS (IF(x IS NOT NULL AND y IS NOT NULL, y, NULL));

CREATE TEMPORARY FUNCTION NonNullCount(y FLOAT64, x FLOAT64)
RETURNS INT64
AS (IF(x IS NOT NULL AND y IS NOT NULL, 1, 0));

CREATE TEMPORARY FUNCTION REGR_R2_COEF(y_var FLOAT64, x_var FLOAT64, xy_corr FLOAT64)
RETURNS FLOAT64
AS (IF(x_var = 0, NULL, IF(y_var = 0 AND x_var <> 0, 1, POWER(xy_corr, 2))));

SELECT k, 
       AVG(NonNullX(y, x)) AS REGR_AVGX,
       AVG(NonNullY(y, x)) AS REGR_AVGY,
       SUM(NonNullCount(y, x)) AS REGR_COUNT,
       AVG(NonNullY(y, x))-(COVAR_POP(NonNullY(y, x), NonNullX(y, x))/VAR_POP(NonNullX(y, x)))*AVG(NonNullX(y, x)) AS REGR_INTERCEPT,
       REGR_R2_COEF(VAR_POP(NonNullY(y, x)), VAR_POP(NonNullX(y, x)), CORR(NonNullY(y, x), NonNullX(y, x))) AS REGR_R2,
       COVAR_POP(NonNullY(y, x), NonNullX(y, x))/VAR_POP(NonNullX(y, x)) AS REGR_SLOPE,
       SUM(NonNullCount(y, x)) * VAR_POP(NonNullX(y, x)) AS REGR_SXX,
       SUM(NonNullCount(y, x)) * COVAR_POP(NonNullY(y, x), NonNullX(y, x)) AS REGR_SXY,
       SUM(NonNullCount(y, x)) * VAR_POP(NonNullY(y, x)) AS REGR_SYY
FROM Temp.SimpleRegressionData
GROUP BY k;

If you don't care about having NULL results you could filter out pairs containing NULLs in a WHERE clause and the aggregate expressions would simplify as follows:

CREATE TEMPORARY FUNCTION REGR_R2_COEF(y_var FLOAT64, x_var FLOAT64, xy_corr FLOAT64)
RETURNS FLOAT64
AS (IF(x_var = 0, NULL, IF(y_var = 0 AND x_var <> 0, 1, POWER(xy_corr, 2))));

SELECT k, 
       AVG(x) AS REGR_AVGX,
       AVG(y) AS REGR_AVGY,
       SUM(1) AS REGR_COUNT,
       AVG(y)-(COVAR_POP(y, x)/VAR_POP(x))*AVG(x) AS REGR_INTERCEPT,
       REGR_R2_COEF(VAR_POP(y), VAR_POP(x), CORR(y, x)) AS REGR_R2,
       COVAR_POP(y, x)/VAR_POP(x) AS REGR_SLOPE,
       SUM(1) * VAR_POP(x) AS REGR_SXX,
       SUM(1) * COVAR_POP(y, x) AS REGR_SXY,
       SUM(1) * VAR_POP(y) AS REGR_SYY
FROM Temp.SimpleRegressionData
WHERE x IS NOT NULL AND Y IS NOT NULL
GROUP BY k;

Note that either of these queries may give a divide by zero error if all of the non-NULL x values for some key are equal.

Upvotes: 4

Related Questions