Kuan Yao
Kuan Yao

Reputation: 165

BigQuery - How to create a new column where the calculation includes the new column itself?

---Summary---

I have three columns: [visitorID], [rank], [numbers].

In BigQuery, I would like to create a new column [calculation], which is part of the summation of [numbers] and [calculation] itself, incluing specified conditions.

The problem I an encountering now is that "In BigQuery, I cannot create a column that needs the calculation including the column I am creating". I am not sure whether my concept or idea is suitable or not, and I hope there are some better suggestions.

---Details---

*The Table I have:

A table with three columns: [visitorID], [rank], [numbers].

*The New Column I need to create:

Need to create the column [calculation].

*The definition of calculation:

After ordering by [visitorID] and [rank], the [calculation] is

(i) If [numbers] = 0, THEN [calculation] = 0 (ii) If [numbers] <> 0, THEN sum up the current [numbers] value and the previous [calculation] number. (iii) Based on (ii), if the summation is larger than 30, then [calculation] = 0, ELSE [calculation] remains the same summation value.

See example as below. enter image description here

*The Problem I am encountering

I need to use BigQuery to do this kind of calculation. However, what I came up with is the "window sum function", which seemed not a good solution to this. I think the key point is that "In BigQuery, I cannot create a column that needs the calculation including the column I am creating".

See example as below. Tried window function, which was failed.

That is, I always need the existed value to create a new column. I have my sample query as follow, which can not solve the problem. And you can also see the print-screen to understand what the problem is.

See sample query as below.

SELECT 
  visitorID, 
  rank, 
  numbers, 
  SUM(numbers) OVER (PARTITION BY visitorID ORDER BY rank) AS window_sum_current, 
  SUM(numbers) OVER (PARTITION BY visitorID ORDER BY rank ROWS BETWEEN 1 PRECEDING AND CURRENT ROW) AS window_sum_prec1     
FROM sample_table

*Seeking for suggestions

I would like to ask for suggestions. (1) In BigQuery, is this problem solvable or not ? (2) What method or concept I am lacking of ? (3) What is a better way to solve the issue in BigQuery ?

Thank you very much.

Upvotes: 2

Views: 4288

Answers (2)

Luke
Luke

Reputation: 848

What you are trying to do is not possible inside a SQL Select statement. Your asking to do something that is going to inherently require variables, loops, and if statements, none of which are available to you inside a SQL Select statement.

Granted, you can do this sort of thing inside SQL code using table valued functions and stored procedures.. Using an outside programming language that was built for exactly what you want to do is likely going to be the easiest route for you.

Run your query for [visitorID], [rank], [numbers] inside SQL, and then just develop the [calculation] data as you output to whatever file or screen you need to.

Hope it clarifies your situation.

Upvotes: 0

Mikhail Berlyant
Mikhail Berlyant

Reputation: 172944

For BigQuery the only solution I can propose so far is as below
It produces exactly result yo expect for the price for you to learn a little about BigQuery User-Defined Functions. Very powerful feature if to use properly. In some cases it has memory issue that is soon to be addressed per this link.
I strongly recommend to adopt this feature!

SELECT visitorID, rank, numbers, calculation FROM JS(
// input table
( SELECT visitorID, 
    GROUP_CONCAT(CONCAT(STRING(100000 + rank), ',', STRING(numbers)), ';') AS list
  FROM
    (SELECT 1001 AS visitorID, 1 AS rank, 0 AS numbers), 
    (SELECT 1001 AS visitorID, 2 AS rank, 13 AS numbers), 
    (SELECT 1001 AS visitorID, 3 AS rank, 13 AS numbers), 
    (SELECT 1001 AS visitorID, 4 AS rank, 6 AS numbers), 
    (SELECT 1001 AS visitorID, 5 AS rank, 4 AS numbers), 
    (SELECT 1001 AS visitorID, 6 AS rank, 5 AS numbers), 
    (SELECT 1001 AS visitorID, 7 AS rank, 26 AS numbers), 
    (SELECT 1001 AS visitorID, 8 AS rank, 32 AS numbers), 
    (SELECT 999 AS visitorID, 1 AS rank, 0 AS numbers), 
    (SELECT 999 AS visitorID, 2 AS rank, 2 AS numbers), 
    (SELECT 999 AS visitorID, 3 AS rank, 2 AS numbers), 
    (SELECT 999 AS visitorID, 4 AS rank, 2 AS numbers), 
    (SELECT 999 AS visitorID, 5 AS rank, 2 AS numbers), 
    (SELECT 999 AS visitorID, 6 AS rank, 24 AS numbers) 
  GROUP BY visitorID
),
// input columns
visitorID, list,
// output schema
"[{name: 'visitorID', type: 'integer'},
{name: 'rank', type: 'integer'},
{name: 'numbers', type: 'integer'},
{name: 'calculation', type: 'integer'}]",
// function
"function(r, emit){
  var list = r.list.split(';');
  list.sort();
  calculation = 0;
  for (var i = 0; i < list.length; i++) {
    rank = parseInt(list[i].split(',')[0]) - 100000;
    numbers = parseInt(list[i].split(',')[1]);
    if (numbers !== 0) calculation += numbers;
    if (calculation > 30) calculation = 0;
    emit({visitorID: r.visitorID, rank: rank, 
          numbers: numbers, calculation: calculation});
  }
 }"
)

Upvotes: 2

Related Questions