Juan David Ossa Gomez
Juan David Ossa Gomez

Reputation: 83

How create a calculated column in google bigquery?

I have a data in Google Bigquery like this

        id      yearmonth value
    00007BR0011 201705     8.0   
    00007BR0011 201701     3.0

and I need to create a table where per id shows the subtraction by year in order to create something like this

id           value
00007BR0011  5

The value 5 is the subtraction of the value in 201705 minus the value in 201701

I am using standard SQL, but don't know how to create the column with the calculation.

Sorry in advance if it is too basic, but didn't find anything yet useful.

Upvotes: 2

Views: 10151

Answers (2)

justbeez
justbeez

Reputation: 1387

It's difficult to answer this based on the current level of detail, but if the smaller value is always subtracted from the larger (and both are never null), you could handle it this way using GROUP BY:

SELECT
  id,
  MAX(value) - MIN(value) AS new_value
FROM
  `your-project.your_dataset.your_table`
GROUP BY
  id

From here, you could save these results as a new table, or save this query as a view definition (which would be similar to having it calculated on the fly if the underlying data is changing).

Another option is to add a column under the table schema, then run an UPDATE query to populate it.

If the smaller value isn't always subtracted from the larger, but rather the lower date is what matters (and there are always two), another way to do this would be to use analytic (or window) functions to select the value with the lowest date:

SELECT
  DISTINCT
    id,
    (
      FIRST_VALUE(value) OVER(PARTITION BY id ORDER BY yearmonth DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
      -
      LAST_VALUE(value) OVER(PARTITION BY id ORDER BY yearmonth DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
    ) AS new_value
FROM
  `your-project.your_dataset.your_table`

Because analytic functions operate on the source rows, DISTINCT is needed to eliminate the duplicate rows.

If there could be more than two rows and you need all the prior values subtracted from the latest value, you could handle it this way (which would also be safe against NULLs or only having one row):

SELECT
  DISTINCT
    id,
    (
      FIRST_VALUE(value) OVER(PARTITION BY id ORDER BY yearmonth DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
      -
      (
        SUM(value) OVER(PARTITION BY id ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
        -
        FIRST_VALUE(value) OVER(PARTITION BY id ORDER BY yearmonth DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
      )
    ) AS new_value
FROM
  `your-project.your_dataset.your_table`

You could technically do the same thing with grouping and ARRAY_AGG with dereferencing, although this method will be significantly slower on larger data sets:

SELECT
  id,
  (
    ARRAY_AGG(value ORDER BY yearmonth DESC)[OFFSET(0)]
    -
    (
      SUM(value)
      -
      ARRAY_AGG(value ORDER BY yearmonth DESC)[OFFSET(0)]
    )
  ) AS new_value
FROM
  `your-project.your_dataset.your_table`
GROUP BY
  id

Upvotes: 0

Gordon Linoff
Gordon Linoff

Reputation: 1269443

Perhaps a single table/result set would work for your purposes:

select id,
       (max(case when yearmonth = 201705 then value end) -
        max(case when yearmonth = 201701 then value end) -
       )
from t
where yearmonth in (201705, 201701)
group by id;

Upvotes: 1

Related Questions