Reputation: 331
I'm trying to append multiple CSV files to one table in BigQuery. The problem is they don't have the same schema. They have keys, but not same fields. Here is an example:
file1.csv
ID A B C
1 T T T
2 F T F
3 F F F
file2.csv
ID A B D
1 T T T
4 F T F
5 F F F
What's the best way to merge these files.
ID A B C D
1 T T T T
2 F T F
3 F F F
4 F T F
5 F F F
Upvotes: 3
Views: 1736
Reputation: 26637
BigQuery doesn't have the concept of keys, so if you "append" two files together that have the same ID column, they won't get merged. But if what you want to do is to just load two files with different schemas and append the data in them, you can do this pretty easily. You can tell bigquery to do the load with a schema update option that says to allow schema changes. You also should pass the explicit schema of the files you're adding. So in your case:
If you have the source files:
$ cat one.csv
ID,A,B,C
1,T,T,T
2,F,T,F
3,F,F,F
$ cat two.csv
ID,A,B,D
1,T,T,T
4,F,T,F
5,F,F,F
you can then do
$ bq load --source_format=CSV --schema=id,a,b,c --skip_leading_rows=1 temp.test one.csv
Current status: DONE
$ bq load --source_format=CSV --schema=id,a,b,d --schema_update_option=ALLOW_FIELD_ADDITION --skip_leading_rows=1 temp.test two.csv
Current status: DONE
$ bq head temp.test
+----+---+---+------+------+
| id | a | b | d | c |
+----+---+---+------+------+
| 1 | T | T | NULL | T |
| 2 | F | T | NULL | F |
| 3 | F | F | NULL | F |
| 1 | T | T | T | NULL |
| 4 | F | T | F | NULL |
| 5 | F | F | F | NULL |
+----+---+---+------+------+
However, this isn't exactly what you said you wanted; it seems like you want to merge the row with ID 1 so it has the data from both files.
The best way to do this is to load to two separate tables, then do a join. If you load to tables temp.t1 and temp.t2, you can then just do a join of the two tables. As in
$ bq load --source_format=CSV --schema=id,a,b,c --skip_leading_rows=1 temp.t1 one.csv
Current status: DONE
$ bq load --source_format=CSV --schema=id,a,b,d --skip_leading_rows=1 temp.t2 two.csv
Current status: DONE
$ bq query --nouse_legacy_sql "SELECT IFNULL(t2.id, t1.id) as id, IFNULL(t2.a, t1.a) as a, IFNULL(t2.b, t1.b) as b, t1.c as c, t2.d as d FROM temp.t1 as t1 FULL OUTER JOIN temp.t2 as t2 ON t1.id = t2.id ORDER BY id"
Current status: DONE
+----+---+---+------+------+
| id | a | b | c | d |
+----+---+---+------+------+
| 1 | T | T | T | T |
| 2 | F | T | F | NULL |
| 3 | F | F | F | NULL |
| 4 | F | T | NULL | F |
| 5 | F | F | NULL | F |
+----+---+---+------+------+
Upvotes: 3
Reputation: 173191
Most likely you need this final table for then to use as a join to some other table(s) - in this case having this pivot-like schema is not the most effective option anyway - so I would recommend considering below option where you will flatten your original matrix into ID - column - value schema
In example below - I am assuming you have features sets with True / False values so I can easily reconcile "conflicting" values using logical AND or OR - but same approach will work if you actually have strings like "T" , "F" (of course in such case below code will need to be adjusted slightly)
So, below is for BigQuery Standard SQL and before you apply this code you just simply loading all your files into separate tables (file1 > table1, file2 > table2 etc.)
#standardSQL
CREATE TEMP FUNCTION x(t STRING) AS ((
ARRAY(SELECT AS STRUCT col, val = 'true' val FROM
UNNEST(REGEXP_EXTRACT_ALL(t, r',"(.+?)":(?:true|false)')) col WITH OFFSET
JOIN UNNEST(REGEXP_EXTRACT_ALL(t, r',".+?":(true|false)')) val WITH OFFSET
USING(OFFSET))
));
SELECT id, col, LOGICAL_OR(val) val
FROM (
SELECT ID, col, val FROM `project.dataset.table1` t, UNNEST(x(TO_JSON_STRING(t)))
UNION ALL
SELECT ID, col, val FROM `project.dataset.table2` t, UNNEST(x(TO_JSON_STRING(t)))
)
GROUP BY id, col
You can add as many below lines as you need
UNION ALL
SELECT ID, col, val FROM `project.dataset.tableX` t, UNNEST(x(TO_JSON_STRING(t)))
You can test, play with above using sample data from your question as in example below
#standardSQL
CREATE TEMP FUNCTION x(t STRING) AS ((
ARRAY(SELECT AS STRUCT col, val = 'true' val FROM
UNNEST(REGEXP_EXTRACT_ALL(t, r',"(.+?)":(?:true|false)')) col WITH OFFSET
JOIN UNNEST(REGEXP_EXTRACT_ALL(t, r',".+?":(true|false)')) val WITH OFFSET
USING(OFFSET))
));
WITH `project.dataset.table1` AS (
SELECT 1 ID, TRUE A, TRUE B, TRUE C UNION ALL
SELECT 2, FALSE, TRUE, FALSE UNION ALL
SELECT 3, FALSE, FALSE, FALSE
), `project.dataset.table2` AS (
SELECT 1 ID, TRUE A, TRUE B, TRUE D UNION ALL
SELECT 4, FALSE, TRUE, FALSE UNION ALL
SELECT 5, FALSE, FALSE, FALSE
)
SELECT id, col, LOGICAL_OR(val) val
FROM (
SELECT ID, col, val FROM `project.dataset.table1` t, UNNEST(x(TO_JSON_STRING(t)))
UNION ALL
SELECT ID, col, val FROM `project.dataset.table2` t, UNNEST(x(TO_JSON_STRING(t)))
)
GROUP BY id, col
-- ORDER BY id, col
with result
Row id col val
1 1 A true
2 1 B true
3 1 C true
4 1 D true
5 2 A false
6 2 B true
7 2 C false
8 3 A false
9 3 B false
10 3 C false
11 4 A false
12 4 B true
13 4 D false
14 5 A false
15 5 B false
16 5 D false
From my experience in most cases using above flatten schema is simpler and easier than schema you initially expected (in you question)
Upvotes: 2
Reputation: 59325
When load JSON files into BigQuery, it will happily identify and load it in the correct columns, as JSON clearly indicates in each record the columns it wants the data loaded. Meanwhile with CSV you can't have the same: When you load CSV tables into BigQuery, BigQuery will just map the columns to the table in the same order for the table and the CSV.
Hence, if you have different CSV schemas, you will need to load them into different BigQuery tables and map them later with an insert or so.
Upvotes: 1