user374374
user374374

Reputation: 343

BigQuery full table to partition

I have a 340 GB of data in one table (270 days worth of data). Now planning move this data to partition table.

That means I will have 270 partitions. What is the best way to move this data to partition table.

I dont want to run 270 queries which is very costly operation. So looking for optimized solution.

I have multiple tables like this. I need to migrate all these tables to partition tables.

Thanks,

Upvotes: 5

Views: 1379

Answers (3)

MassyB
MassyB

Reputation: 1184

I will add another fourth option to @Mikhail's answer

DML QUERY

  • Action = 1 query to run
  • Full scans = 1
  • Cost = $5 x 0.34 = 1.7$ (x270 times cheaper than solution #1 \o/)

With the new DML feature of BiQuery you can convert a none partitioned table to a partitioned one while doing only one full scan of the source table

To illustrate my solution I will use one of BQ's public tables, namely bigquery-public-data:hacker_news.comments. below is the tables schema

name    | type      | description
_________________________________
id      | INTGER    | ...
_________________________________
by      | STRING    | ...
_________________________________
author  | STRING    | ...
_________________________________
...     |           |
_________________________________
time_ts | TIMESTAMP | human readable timestamp in UTC YYYY-MM-DD hh:mm:ss /!\ /!\ /!\
_________________________________
...     |           |
_________________________________

We are going to partition the comments table based on time_ts

#standardSQL
CREATE TABLE my_dataset.comments_partitioned
PARTITION BY DATE(time_ts)
AS 
SELECT *
FROM `bigquery-public-data:hacker_news.comments` 

I hope it helps :)

Upvotes: 3

Mikhail Berlyant
Mikhail Berlyant

Reputation: 172944

I see three options

  1. Direct Extraction out of original table:
    Actions (how many queries to run) = Days [to extract] = 270
    Full Scans (how much data scanned measured in full scans of original table) = Days = 270
    Cost, $ = $5 x Table Size, TB xFull Scans = $5 x 0.34 x 270 = $459.00

  2. Hierarchical(recursive) Extraction: (described in Mosha’s answer)
    Actions = 2^log2(Days) – 2 = 510
    Full Scans = 2*log2(Days) = 18
    Cost, $ = $5 x Table Size, TB xFull Scans = $5 x 0.34 x 18 = $30.60

  3. Clustered Extraction: (I will describe it in a sec)
    Actions = Days + 1 = 271
    Full Scans = [always]2 = 2
    Cost, $ = $5 x Table Size, TB xFull Scans = $5 x 0.34 x 2 = $3.40

Summary

Method                              Actions Total Full Scans    Total Cost  
Direct Extraction                       270              270       $459.00
Hierarchical(recursive) Extraction      510               18        $30.60
Clustered Extraction                    271                2         $3.40  

Definitely, for most practical purposes Mosha’s solution is way to go (I use it in most such cases)
It is relatively simple and straightforward

Even though you need to run query 510 times – the query is "relatively" simple and orchestration logic is simple to implement with whatever client you usually use
And cost save is quite visible! From $460 down to $31!
Almost 15 times down!

In case if you -
a) want to lower cost even further for yet another 9 times (so it will be total x135 times lower)
b) and like having fun and more challenges - take a look at third option

“Clustered Extraction” Explanation

Idea / Goal:
Step 1
We want to transform original table into another [single] table with 270 columns – one column for one day
Each column will hold one serialized row for respective day from original table
Total number of rows in this new table will be equal to number of rows for most "heavy" day
This will require just one query (see example below) with one full scan

Step 2 After this new table is ready – we will be extracting day-by-day querying ONLY respective column and write into final daily table (schema of daily table are the very same as original table’s schema and all those tables could be pre-created) This will require 270 queries to be run with scans approximately equivalent (this really depends on how complex your schema, so can vary) to one full size of original table While querying column – we will need to de-serialize row’s value and parse it back to original scheme

Very simplified example: (using BigQuery Standard SQL here)

The purpose of this example is just to give direction if you will find idea interesting for you
Serialization / de-serialization is extremely simplified to keep focus on idea and less on particular implementation which can be different from case to case (mostly depends on schema)

So, assume original table (theTable) looks somehow like below

  SELECT  1 AS id, "101" AS x, 1 AS ts UNION ALL
  SELECT  2 AS id, "102" AS x, 1 AS ts UNION ALL
  SELECT  3 AS id, "103" AS x, 1 AS ts UNION ALL
  SELECT  4 AS id, "104" AS x, 1 AS ts UNION ALL
  SELECT  5 AS id, "105" AS x, 1 AS ts UNION ALL
  SELECT  6 AS id, "106" AS x, 2 AS ts UNION ALL
  SELECT  7 AS id, "107" AS x, 2 AS ts UNION ALL
  SELECT  8 AS id, "108" AS x, 2 AS ts UNION ALL
  SELECT  9 AS id, "109" AS x, 2 AS ts UNION ALL
  SELECT 10 AS id, "110" AS x, 3 AS ts UNION ALL
  SELECT 11 AS id, "111" AS x, 3 AS ts UNION ALL
  SELECT 12 AS id, "112" AS x, 3 AS ts UNION ALL
  SELECT 13 AS id, "113" AS x, 3 AS ts UNION ALL
  SELECT 14 AS id, "114" AS x, 3 AS ts UNION ALL
  SELECT 15 AS id, "115" AS x, 3 AS ts UNION ALL
  SELECT 16 AS id, "116" AS x, 3 AS ts UNION ALL
  SELECT 17 AS id, "117" AS x, 3 AS ts UNION ALL
  SELECT 18 AS id, "118" AS x, 3 AS ts UNION ALL
  SELECT 19 AS id, "119" AS x, 4 AS ts UNION ALL
  SELECT 20 AS id, "120" AS x, 4 AS ts

Step 1 – transform table and write result into tempTable

SELECT 
  num, 
  MAX(IF(ts=1, ser, NULL)) AS ts_1, 
  MAX(IF(ts=2, ser, NULL)) AS ts_2, 
  MAX(IF(ts=3, ser, NULL)) AS ts_3, 
  MAX(IF(ts=4, ser, NULL)) AS ts_4
FROM (
  SELECT 
    ts, 
    CONCAT(CAST(id AS STRING), "|", x, "|", CAST(ts AS STRING)) AS ser, 
    ROW_NUMBER() OVER(PARTITION BY ts ORDER BY id) num
  FROM theTable
)
GROUP BY num

tempTable will look like below:

num    ts_1        ts_2         ts_3        ts_4     
1   1|101|1     6|106|2     10|110|3    19|119|4     
2   2|102|1     7|107|2     11|111|3    20|120|4     
3   3|103|1     8|108|2     12|112|3        null     
4   4|104|1     9|109|2     13|113|3        null     
5   5|105|1        null     14|114|3        null     
6      null        null     15|115|3        null     
7      null        null     16|116|3        null     
8      null        null     17|117|3        null     
9      null        null     18|118|3        null    

Here, I am using simple concatenation for serialization

Step 2 – extracting rows for specific day and write output to respective daily table
Please note: In below example - we extracting rows for ts = 2 : this corresponds to column ts_2

SELECT
  r[OFFSET(0)] AS id,
  r[OFFSET(1)] AS x,
  r[OFFSET(2)] AS ts
FROM (
  SELECT SPLIT(ts_2, "|") AS r 
  FROM tempTable 
  WHERE NOT ts_2 IS NULL
)

The result will look like below (which is expected):

id    x    ts    
6   106     2    
7   107     2    
8   108     2    
9   109     2   

I wish I had more time for this to write down, so don’t judge to heavy if something missing – this is more directional answer - but at the same time example is pretty reasonable and if you have plain simple schema – almost no extra thinking is required. Of course with records, nested stuff in schema - most challenging part is serialization / de-serialization – but that’s where fun is – along with extra $saving

Upvotes: 5

Mosha Pasumansky
Mosha Pasumansky

Reputation: 13994

If your data was in sharded tables (i.e. with YYYYmmdd suffix), you could've used "bq partition" command. But with data in a single table - you will have to scan it multiple times applying different WHERE clauses on your partition key column. The only optimization I can think of is to do it hierarchically, i.e. instead of 270 queries which will do 270 full table scans - first split table in half, then each half in half etc. This way you will need to pay for 2*log_2(270) = 2*9 = 18 full scans.

Once the conversion is done - all the temporary tables can be deleted to eliminate extra storage costs.

Upvotes: 1

Related Questions