AmirCS
AmirCS

Reputation: 331

Appending files with different schemas - BigQuery

I'm trying to append multiple CSV files to one table in BigQuery. The problem is they don't have the same schema. They have keys, but not same fields. Here is an example:

file1.csv

ID     A       B       C
1       T       T       T
2       F       T       F
3       F       F       F

file2.csv

ID     A       B       D
1       T       T       T
4       F       T       F
5       F       F       F

What's the best way to merge these files.

ID     A       B       C       D
1      T       T       T       T
2      F       T       F
3      F       F       F
4      F       T               F
5      F       F               F

Upvotes: 3

Views: 1736

Answers (3)

Jordan Tigani
Jordan Tigani

Reputation: 26637

BigQuery doesn't have the concept of keys, so if you "append" two files together that have the same ID column, they won't get merged. But if what you want to do is to just load two files with different schemas and append the data in them, you can do this pretty easily. You can tell bigquery to do the load with a schema update option that says to allow schema changes. You also should pass the explicit schema of the files you're adding. So in your case:

If you have the source files:

$ cat one.csv
ID,A,B,C
1,T,T,T
2,F,T,F
3,F,F,F

$ cat two.csv
ID,A,B,D
1,T,T,T
4,F,T,F
5,F,F,F

you can then do

$ bq load --source_format=CSV --schema=id,a,b,c --skip_leading_rows=1 temp.test one.csv
Current status: DONE   
$ bq load --source_format=CSV --schema=id,a,b,d --schema_update_option=ALLOW_FIELD_ADDITION --skip_leading_rows=1 temp.test two.csv
Current status: DONE   
$ bq head temp.test
+----+---+---+------+------+
| id | a | b |  d   |  c   |
+----+---+---+------+------+
| 1  | T | T | NULL | T    |
| 2  | F | T | NULL | F    |
| 3  | F | F | NULL | F    |
| 1  | T | T | T    | NULL |
| 4  | F | T | F    | NULL |
| 5  | F | F | F    | NULL |
+----+---+---+------+------+

However, this isn't exactly what you said you wanted; it seems like you want to merge the row with ID 1 so it has the data from both files.

The best way to do this is to load to two separate tables, then do a join. If you load to tables temp.t1 and temp.t2, you can then just do a join of the two tables. As in

$ bq load --source_format=CSV --schema=id,a,b,c --skip_leading_rows=1 temp.t1 one.csv
Current status: DONE   
$ bq load --source_format=CSV --schema=id,a,b,d --skip_leading_rows=1 temp.t2 two.csv
Current status: DONE   
$ bq query --nouse_legacy_sql "SELECT IFNULL(t2.id, t1.id) as id, IFNULL(t2.a,  t1.a) as a, IFNULL(t2.b, t1.b) as b, t1.c as c, t2.d as d   FROM temp.t1 as t1 FULL OUTER JOIN temp.t2 as t2 ON t1.id = t2.id ORDER BY id"
Current status: DONE   
+----+---+---+------+------+
| id | a | b |  c   |  d   |
+----+---+---+------+------+
| 1  | T | T | T    | T    |
| 2  | F | T | F    | NULL |
| 3  | F | F | F    | NULL |
| 4  | F | T | NULL | F    |
| 5  | F | F | NULL | F    |
+----+---+---+------+------+

Upvotes: 3

Mikhail Berlyant
Mikhail Berlyant

Reputation: 173191

Most likely you need this final table for then to use as a join to some other table(s) - in this case having this pivot-like schema is not the most effective option anyway - so I would recommend considering below option where you will flatten your original matrix into ID - column - value schema

In example below - I am assuming you have features sets with True / False values so I can easily reconcile "conflicting" values using logical AND or OR - but same approach will work if you actually have strings like "T" , "F" (of course in such case below code will need to be adjusted slightly)

So, below is for BigQuery Standard SQL and before you apply this code you just simply loading all your files into separate tables (file1 > table1, file2 > table2 etc.)

#standardSQL
CREATE TEMP FUNCTION x(t STRING) AS ((
  ARRAY(SELECT AS STRUCT col, val = 'true' val FROM
  UNNEST(REGEXP_EXTRACT_ALL(t, r',"(.+?)":(?:true|false)')) col WITH OFFSET
  JOIN UNNEST(REGEXP_EXTRACT_ALL(t, r',".+?":(true|false)')) val WITH OFFSET
  USING(OFFSET))  
));
SELECT id, col, LOGICAL_OR(val) val
FROM (
  SELECT ID, col, val FROM `project.dataset.table1` t, UNNEST(x(TO_JSON_STRING(t))) 
  UNION ALL
  SELECT ID, col, val FROM `project.dataset.table2` t, UNNEST(x(TO_JSON_STRING(t))) 
)
GROUP BY id, col

You can add as many below lines as you need

UNION ALL
SELECT ID, col, val FROM `project.dataset.tableX` t, UNNEST(x(TO_JSON_STRING(t))) 

You can test, play with above using sample data from your question as in example below

#standardSQL
CREATE TEMP FUNCTION x(t STRING) AS ((
  ARRAY(SELECT AS STRUCT col, val = 'true' val FROM
  UNNEST(REGEXP_EXTRACT_ALL(t, r',"(.+?)":(?:true|false)')) col WITH OFFSET
  JOIN UNNEST(REGEXP_EXTRACT_ALL(t, r',".+?":(true|false)')) val WITH OFFSET
  USING(OFFSET))  
));
WITH `project.dataset.table1` AS (
  SELECT 1 ID, TRUE A, TRUE B, TRUE C UNION ALL
  SELECT 2, FALSE, TRUE, FALSE UNION ALL
  SELECT 3, FALSE, FALSE, FALSE 
), `project.dataset.table2` AS (
  SELECT 1 ID, TRUE A, TRUE B, TRUE D UNION ALL
  SELECT 4, FALSE, TRUE, FALSE UNION ALL
  SELECT 5, FALSE, FALSE, FALSE 
)
SELECT id, col, LOGICAL_OR(val) val
FROM (
  SELECT ID, col, val FROM `project.dataset.table1` t, UNNEST(x(TO_JSON_STRING(t))) 
  UNION ALL
  SELECT ID, col, val FROM `project.dataset.table2` t, UNNEST(x(TO_JSON_STRING(t))) 
)
GROUP BY id, col
-- ORDER BY id, col

with result

Row id  col val  
1   1   A   true     
2   1   B   true     
3   1   C   true     
4   1   D   true     
5   2   A   false    
6   2   B   true     
7   2   C   false    
8   3   A   false    
9   3   B   false    
10  3   C   false    
11  4   A   false    
12  4   B   true     
13  4   D   false    
14  5   A   false    
15  5   B   false    
16  5   D   false    

From my experience in most cases using above flatten schema is simpler and easier than schema you initially expected (in you question)

Upvotes: 2

Felipe Hoffa
Felipe Hoffa

Reputation: 59325

When load JSON files into BigQuery, it will happily identify and load it in the correct columns, as JSON clearly indicates in each record the columns it wants the data loaded. Meanwhile with CSV you can't have the same: When you load CSV tables into BigQuery, BigQuery will just map the columns to the table in the same order for the table and the CSV.

Hence, if you have different CSV schemas, you will need to load them into different BigQuery tables and map them later with an insert or so.

Upvotes: 1

Related Questions