Reputation: 716

Deal with null values from BigQuery in Google Cloud Dataflow Python

The issue I have is when I read data from GBQ with null values, then if I try to map any function with the column of null values, it will give errors.

When I write the input_data from GBQ to text, the json output file does not have the key with null values. I believe this is an issue that need to be fixed.

For example: - Input key_1,key_2,key_3 value_1,,value_3

Expected output: {"key_1":"value_1","key_2":null,"key_3":"value_3"}
Output from Dataflow {"key_1":"value_1","key_3":"value_3"}

Upvotes: 2

Answers (1)

Silviu

Reputation: 231

For now, there is not much we can do at Dataflow level. As you pointed out the JSON coming out of BigQuery does not have the null values. This will be improved (but not in the next immediate release) if we switch to AVRO as an intermediary format for exports. You can insert a one-liner function to "clean up" the data by adding the missing nullable fields. See the example below:

def add_null_field(row, field):
  row.update({field: row.get(field, None)})
  return row

(p 
  | df.io.Read(df.io.BigQuerySource( 'PROJECT:DATASET.TABLE'))
  | df.Map(add_null_field, field='value')
  | df.io.Write(df.io.TextFileSink('gs://BUCKET/FILES)))

Hope this helps.

Upvotes: 3

Deal with null values from BigQuery in Google Cloud Dataflow Python

Answers (1)

Related Questions