How to create a pipeline to just rename columns?

Question

My use case is quite simple. A data provider puts CSV files on Google Storage every hour. Once a day, a Transfer job picks up those files and sends to a BigQuery table. The problem is that I need to rename those fields.

The first thing that came to me was simply using a scheduled query like

SELECT bad_field AS good_field (...) FROM raw_table

and schedule that query to append the results to a nicer table. I'm not sure if that's a very clean solution because I'm just starting out with BigQuery.

I have also read about DataFlow but I have no experience with that service.

Does anybody have had this sort of scenario? How did you deal with it?

saifuddin778 · Accepted Answer

You can schedule a local cron to do this job for you and load your data with fields named whatever you want into a BigQuery table. All you have to do is to specify a schema based on which your CSV files are to be parsed, and you will be good with the naming of your columns.

Let's say your CSV file has 3 columns, and you want them to be named X,Y,Z. Then you can schedule this gcloud command to do the job for you:

 bq --location=US \  
 load \ 
 --replace=false \ 
 --source_format=CSV \ 
 --skip_leading_rows=1 \ 
 `yourdataset.tablename` \ 
 gs://yourbucket/date/*.csv \ 
 --schema=X:STRING,Y:FLOAT,Z:STRING

Please note the following:

The --skip_leading_rows=1 parameter (which assumes that first row is header and it just skips it).
The --replace=false parameter (which doesnt let the job to overwrite, but to append to the target table.
The --schema parameter, which specifies the schema as a comma-separated list of KEY:FORMAT.

Hope it helps.

How to create a pipeline to just rename columns?

Answers (1)

Related Questions