Roman
Roman

Reputation: 9441

Bigquery conditional split aggregations fail

I have a category column with a string containing subcategory fields in variable positions separated by a "|". The location of each subcategory is dependent on the number of items in the string. For example:

category    subcat1    subcat2    subcat3
a|b|c       b          c          a
x|y|a|b     b          null       a

So, to solve for one category, I have:

SELECT
  a.category AS category,
  case
    WHEN COUNT(SPLIT(a.category, "|")) = 4 then nth(4, SPLIT(a.category, "|"))
    WHEN COUNT(SPLIT(a.category, "|")) = 3 then nth(2, SPLIT(a.category, "|"))
    WHEN COUNT(SPLIT(a.category, "|")) = 2 then nth(2, SPLIT(a.category, "|"))
    else null
  end as subcat1,
  --nth(2, SPLIT(a.category, "|")) as x      --uncomment for success.  see below
FROM
  [interim_groups.articles_unique] as a

Running this fails with:

SELECT clause has mix of aggregations 'subcat1' and fields 'category' without GROUP BY clause

Now I don't want a group by clause and it doesn't make sense to have one, but if I include it it starts complaining scoped aggregations, which seems to be going in the wrong direction.

The same occurs if I use an if statement instead of a case statement.

Now here is the strange bit. If I have the commented line (or alternativelylast(SPLIT(a.category, "|")) as x) in my query, the query passes flawlessly.

Is this a bug? My query looks right, and having an extra column in my query somehow making it pass is strange.
Is there a better way to fix this than just leaving in an unnecessary column to stabilize the query?

Upvotes: 3

Views: 779

Answers (2)

Felipe Hoffa
Felipe Hoffa

Reputation: 59175

The query is missing the 'WITHIN RECORD' keyword.

SELECT
  a.category AS category,
  case
    WHEN COUNT(SPLIT(a.category, "|")) = 4 then nth(4, SPLIT(a.category, "|"))
    WHEN COUNT(SPLIT(a.category, "|")) = 3 then nth(2, SPLIT(a.category, "|"))
    WHEN COUNT(SPLIT(a.category, "|")) = 2 then nth(2, SPLIT(a.category, "|"))
    else null
  end WITHIN RECORD as subcat1 ,
FROM (SELECT category FROM
  (SELECT 'a|b|c' category), (SELECT 'a|b' category), (SELECT 'a|b|c|d' category)) a

See the documentation for WITHIN:

The WITHIN keyword specifically works with aggregate functions to aggregate across children and repeated fields within records and nested fields. When you specify the WITHIN keyword, you need to specify the scope over which you want to aggregate - WITHIN RECORD: Aggregates data in the repeated values within the record.

https://cloud.google.com/bigquery/docs/data#within

Upvotes: 5

Mikhail Berlyant
Mikhail Berlyant

Reputation: 172993

I think using String functions you can relatively easily extract last component in pipe separated string

Meantime, below version with use of Regular expression functions - it is a little more powerfull for potentially more complex scenarious

SELECT
  a.category AS category,
  CASE
    WHEN LENGTH(REGEXP_REPLACE(a.category, r"[^|]", "")) = 4 
           THEN REGEXP_EXTRACT(a.category, r'\|\w+\|\w+\|\w+\|(\w+)')
    WHEN LENGTH(REGEXP_REPLACE(a.category, r"[^|]", "")) = 3 
           THEN REGEXP_EXTRACT(a.category, r'\|\w+\|\w+\|(\w+)')
    WHEN LENGTH(REGEXP_REPLACE(a.category, r"[^|]", "")) = 2 
           THEN REGEXP_EXTRACT(a.category, r'\|\w+\|(\w+)')
    ELSE NULL
  END AS subcat1
FROM your_table a

No grouping, no unnecessary column as asked in question!

Upvotes: 2

Related Questions