Loading multiple files of different scheme to BigQuery using DataFlow

Question

I have a set of .txt file with data in JSON format with varying schema, all of them to be loaded to a common bigquery table. File format is simple, one set of files will have 4 columns, some has 5, some has same 5 but in different order...

Eg: File inFolder 1 '{"name":"John", "age":31, "city":"New York"}'

File in Folder 2 '{"name":"Tom", "age":32, "city":"New York", "zip":"12345"}'

File in Folder 3 '{"name":"Janice", "age":31, "zip":"12345", "city":"New York"}'

File in Folder 4 '{"name":"Alice", "age":34, "zip":"12348"}'

File in Folder 5 '{"name":"Rob", "age":31, "zip":"12345", "Phone":"1234567891"}'

The BQ table will be of columns, name, age, city, zip, phone.

Each types of files are in separate folders in GCS, files in each folder are consistent , with in a GCS bucket subfolder all files are of same type, same number and order of columns.

I am able to load these files individually to corresponding BQ tables, but I need them to a single table by loading NULL for the 'not applicable' column .

seeking advices to load data using data flow to BQ as a batch process. I am pretty new to this, Appreciate your help.

Peter Kim · Accepted Answer

Cloud Storage Text to BigQuery, a Google-provided template, can be used for your use case.

As described in the documentation, please create the BigQuery schema file that's a union of all known schemas and a JavaScript file to handle any missing fields (BigQuery may be able to automatically replace missing values with nulls, but I haven't tested that).

You can follow this doc to launch a template job.

Update: based on the comment, it looks like you're interested in creating your own pipeline. This is how you'd read from GCS and write to BigQuery.

schema="name:STRING,age:INTEGER,zip:STRING,phone:STRING,city:STRING"

_ = (p
    | "read messages" >> beam.io.ReadFromText('gs://path/to/files/*.txt')
    | "parse" >> beam.Map(json.loads)
    | 'write to bigquery' >> beam.io.WriteToBigQuery('project:datasetId.tableId', schema=schema)
)

Unspecified values are automatically saved as nulls.

This is how I saved the data.

$ gsutil cat gs://path/to/files/example.txt
{"name":"John", "age":31, "city":"New York"}

Loading multiple files of different scheme to BigQuery using DataFlow

Answers (1)

Related Questions