Mani Shankar.S
Mani Shankar.S

Reputation: 39

Split the output of a Bigquery table into batch and check spill over data in the output - GCP

I have a requirement, where I need to get the output of a Bigquery table into file less than 28mb and check the data spill over in the output file and align it accordingly.

For Eg:- Output of a BQ table is 100 mb CSV file, I would need to split the file into 3 chunk where 28 mb each and last file may be smaller. Also I need to check where the data spill over should not happen between the files.

Lets say last line of the File1 has USA data and first line of the second file has USA data, I need to move the USA data from first file to second file.

Data may looks like,

USA|DEALER1|PART1|10|30
...
..
..
AFG|DEALER1|PART1|10|80

Steps that I have tried is,

Extracted the data from BQ table order by country, I got one single file with 100mb

gsutil cat -r 0-28000000 gs://<bucket>/<object>/000000000000.csv | gsutil cp - gs://<bucket>/<object>/outp1.txt

this splited the file into 28 mb, but records are incomplete which means PARTIAL records are there in the last line of the file [AFG|DEALER1|Pa.... ], which tend to data loss. I need to provide the range condition manually, that could be one of the drawback

Also not sure, how to do the record spill over check.

Can someone share an elegant way to achieve this?

Upvotes: 0

Views: 54

Answers (0)

Related Questions