manesioz
manesioz

Reputation: 837

Compose Google Storage Objects without headers via CLI

I was wondering if it would be possible to compose Google Storage Objects (specifically csv files) without headers (i.e. without the row with column names) while using gsutil.

Currently, I can do the following:

gsutil compose gs://bucket/test_file_1.csv gs://bucket/test_file_2.csv gs://bucket/test-composition-files.csv

However, I will be unable to ingest test-composition-files.csv into Google BigQuery because compose blindly appended the files (including the column names).

One possible solution would be to download the file locally and process it with pandas, but this is not ideal for large files.

Is there any way to do this via the CLI? I could not find anything in the docs.

Upvotes: 3

Views: 1023

Answers (1)

guillaume blaquiere
guillaume blaquiere

Reputation: 75775

By reading the comment, I think you are spending effort in the wrong way. I understood that you wanted to load your files into big query, but the large number of file prevented you to do this (too many API calls). And dataflow is too slow.

Maybe you can think differently. I have 2 solutions to propose

  • If you need "near real time" ingestion, and if file size is bellow 1.5Gb, the best way is to build a function which read the file and perform a stream write to BigQuery. This function is triggered by a Cloud Storage event. If there is several file in the same time, several functions will be spawn. Be careful, stream write to BigQuery is not free
  • If you can wait up to 2 minutes when a file arrive, I recommend you to build a Cloud Functions, triggered every 2 minutes. This function read the file name in a bucket, move them to a sub directory and perform a load job of all the files in the sub directory. You are limited to 1000 load jobs per day (and per table), a day contains 1440 minutes. Batch every 2 minutes you are OK. The load job are free.

Is it acceptable alternatives?

Upvotes: 1

Related Questions