Reputation: 43
I have a set of .txt file with data in JSON format with varying schema, all of them to be loaded to a common bigquery table. File format is simple, one set of files will have 4 columns, some has 5, some has same 5 but in different order...
Eg: File inFolder 1 '{"name":"John", "age":31, "city":"New York"}'
File in Folder 2 '{"name":"Tom", "age":32, "city":"New York", "zip":"12345"}'
File in Folder 3 '{"name":"Janice", "age":31, "zip":"12345", "city":"New York"}'
File in Folder 4 '{"name":"Alice", "age":34, "zip":"12348"}'
File in Folder 5 '{"name":"Rob", "age":31, "zip":"12345", "Phone":"1234567891"}'
The BQ table will be of columns, name, age, city, zip, phone.
Each types of files are in separate folders in GCS, files in each folder are consistent , with in a GCS bucket subfolder all files are of same type, same number and order of columns.
I am able to load these files individually to corresponding BQ tables, but I need them to a single table by loading NULL for the 'not applicable' column .
seeking advices to load data using data flow to BQ as a batch process. I am pretty new to this, Appreciate your help.
Upvotes: 0
Views: 827
Reputation: 1977
Cloud Storage Text to BigQuery, a Google-provided template, can be used for your use case.
As described in the documentation, please create the BigQuery schema file that's a union of all known schemas and a JavaScript file to handle any missing fields (BigQuery may be able to automatically replace missing values with nulls, but I haven't tested that).
You can follow this doc to launch a template job.
Update: based on the comment, it looks like you're interested in creating your own pipeline. This is how you'd read from GCS and write to BigQuery.
schema="name:STRING,age:INTEGER,zip:STRING,phone:STRING,city:STRING"
_ = (p
| "read messages" >> beam.io.ReadFromText('gs://path/to/files/*.txt')
| "parse" >> beam.Map(json.loads)
| 'write to bigquery' >> beam.io.WriteToBigQuery('project:datasetId.tableId', schema=schema)
)
Unspecified values are automatically saved as null
s.
This is how I saved the data.
$ gsutil cat gs://path/to/files/example.txt
{"name":"John", "age":31, "city":"New York"}
Upvotes: 1