Reputation: 716
The issue I have is when I read data from GBQ with null values, then if I try to map any function with the column of null values, it will give errors.
When I write the input_data from GBQ to text, the json output file does not have the key with null values. I believe this is an issue that need to be fixed.
For example:
- Input
key_1,key_2,key_3
value_1,,value_3
Expected output:
{"key_1":"value_1","key_2":null,"key_3":"value_3"}
Output from Dataflow
{"key_1":"value_1","key_3":"value_3"}
Upvotes: 2
Views: 1292
Reputation: 231
For now, there is not much we can do at Dataflow level. As you pointed out the JSON coming out of BigQuery does not have the null values. This will be improved (but not in the next immediate release) if we switch to AVRO as an intermediary format for exports. You can insert a one-liner function to "clean up" the data by adding the missing nullable fields. See the example below:
def add_null_field(row, field):
row.update({field: row.get(field, None)})
return row
(p
| df.io.Read(df.io.BigQuerySource( 'PROJECT:DATASET.TABLE'))
| df.Map(add_null_field, field='value')
| df.io.Write(df.io.TextFileSink('gs://BUCKET/FILES)))
Hope this helps.
Upvotes: 3