Rutger Hofste
Rutger Hofste

Reputation: 4373

Get all first elements of list/array in BigQuery

I have a large number of .csv files with the following cell values:

"[[0.0, 4.0],  .... , [240.0, 0.0], [248.0, 0.0]]"

The string contains a nested list and is a result of a histogram reducer with 32 bins for 8bit data and contains the lower bin value and the count.

For instance, the first element contains the lower bin value of the 1st bin (0.0) and the count (4.0). The last element contains the lower bin value of the 32nd bin (248.0) and count (0.0).

Since the lower bin values do not change and are known [0,8,16 ... 248], I would like to extract only the counts i.e.

[4, .... ,  0 ]

In Python, this would be straight forward, however the amount of data is quite big and I have 3,422,250 of these histograms. Therefore I considered using Google BigQuery to get the job done.

When I load the cvs data in BigQuery, the histograms are stored as type STRING.

How can I get nested lists (arrays) that are stored as string in csv, in the ARRAY datatype in BigQuery? In the documentation, it says that nested arrays are not yet supported. Are there workarounds?

guidance on how to get the first element of multiple arrays is very welcome too!

p.s. I already tried to solve the problem upstream to no avail.

Example csv file

Upvotes: 1

Views: 2982

Answers (1)

Mikhail Berlyant
Mikhail Berlyant

Reputation: 172993

Not sure if it is exactly what you are asking, but hope below example (for BigQuery Standard SQL) will help you

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 1 id,'[[0.0, 4.0], [8.0, 0.0], [16.0, 0.0], [24.0, 0.0], [32.0, 0.0], [40.0, 0.0], [48.0, 0.0], [56.0, 0.0], [64.0, 1.0], [72.0, 1.0], [80.0, 4.0], [88.0, 0.0], [96.0, 0.0], [104.0, 0.0], [112.0, 0.0], [120.0, 0.0], [128.0, 0.0], [136.0, 0.0], [144.0, 0.0], [152.0, 0.0], [160.0, 0.0], [168.0, 0.0], [176.0, 0.0], [184.0, 0.0], [192.0, 0.0], [200.0, 0.0], [208.0, 0.0], [216.0, 0.0], [224.0, 0.0], [232.0, 0.0], [240.0, 0.0], [248.0, 0.0]]' histogram UNION ALL
  SELECT 2, '[[0.0, 0.0], [8.0, 0.0], [16.0, 0.0], [24.0, 0.0], [32.0, 0.0], [40.0, 0.0], [48.0, 0.0], [56.0, 0.0], [64.0, 0.0], [72.0, 0.0], [80.0, 0.0], [88.0, 0.0], [96.0, 0.0], [104.0, 0.0], [112.0, 1.0], [120.0, 0.0], [128.0, 1.0], [136.0, 0.0], [144.0, 0.0], [152.0, 0.0], [160.0, 0.0], [168.0, 0.0], [176.0, 0.0], [184.0, 0.0], [192.0, 0.0], [200.0, 0.0], [208.0, 0.0], [216.0, 0.0], [224.0, 0.0], [232.0, 0.0], [240.0, 0.0], [248.0, 0.0]]'
)
SELECT id, 
  SPLIT(bin)[OFFSET(0)] value,
  SPLIT(bin)[OFFSET(1)] frequency
FROM `project.dataset.table`, UNNEST(SPLIT(REGEXP_REPLACE(histogram, r'\[\[|]]|\s', ''), '],[')) bin    

Note: this assumes When I load the cvs data in BigQuery, the histograms are stored as type STRING as

"[[0.0, 4.0],  .... , [240.0, 0.0], [248.0, 0.0]]"     

OR - if you want to keep rows intact and have histogram presented as string to be transformed into array - you can try below

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 1 id,'[[0.0, 4.0], [8.0, 0.0], [16.0, 0.0], [24.0, 0.0], [32.0, 0.0], [40.0, 0.0], [48.0, 0.0], [56.0, 0.0], [64.0, 1.0], [72.0, 1.0], [80.0, 4.0], [88.0, 0.0], [96.0, 0.0], [104.0, 0.0], [112.0, 0.0], [120.0, 0.0], [128.0, 0.0], [136.0, 0.0], [144.0, 0.0], [152.0, 0.0], [160.0, 0.0], [168.0, 0.0], [176.0, 0.0], [184.0, 0.0], [192.0, 0.0], [200.0, 0.0], [208.0, 0.0], [216.0, 0.0], [224.0, 0.0], [232.0, 0.0], [240.0, 0.0], [248.0, 0.0]]' histogram UNION ALL
  SELECT 2, '[[0.0, 0.0], [8.0, 0.0], [16.0, 0.0], [24.0, 0.0], [32.0, 0.0], [40.0, 0.0], [48.0, 0.0], [56.0, 0.0], [64.0, 0.0], [72.0, 0.0], [80.0, 0.0], [88.0, 0.0], [96.0, 0.0], [104.0, 0.0], [112.0, 1.0], [120.0, 0.0], [128.0, 1.0], [136.0, 0.0], [144.0, 0.0], [152.0, 0.0], [160.0, 0.0], [168.0, 0.0], [176.0, 0.0], [184.0, 0.0], [192.0, 0.0], [200.0, 0.0], [208.0, 0.0], [216.0, 0.0], [224.0, 0.0], [232.0, 0.0], [240.0, 0.0], [248.0, 0.0]]'
)
SELECT id, 
  ARRAY(
    SELECT AS STRUCT
      SPLIT(bin)[OFFSET(0)] value,
      SPLIT(bin)[OFFSET(1)] frequency
    FROM UNNEST(SPLIT(REGEXP_REPLACE(histogram, r'\[\[|]]|\s', ''), '],[')) bin
  ) histogram_as_array
FROM `project.dataset.table`

Upvotes: 3

Related Questions