Reputation: 8010
I am working on exporting large dataset from bigquery to Goolge cloud storage into the compressed format. In Google cloud storage I have file size limitation( maximum file size 1GB each file). Therefore I am using split and compassion techniques to split data while exporting. The sample code is as follow:
gcs_destination_uri = 'gs://{}/{}'.format(bucket_name, 'wikipedia-*.csv.gz')
gcs_bucket = storage_client.get_bucket(bucket_name)
# Job Config
job_config = bigquery.job.ExtractJobConfig()
job_config.compression = bigquery.Compression.GZIP
def bigquery_datalake_load():
dataset_ref = bigquery_client.dataset(dataset_id, project=project)
table_ref = dataset_ref.table(table_id)
table = bigquery_client.get_table(table_ref) # API Request
row_count = table.num_rows
extract_job = bigquery_client.extract_table(
table_ref,
gcs_destination_uri,
location='US',
job_config=job_config) # API request
logging.info('BigQuery extract Started.... Wait for the job to complete.')
extract_job.result() # Waits for job to complete.
print('Exported {}:{}.{} to {}'.format(
project, dataset_id, table_id, gcs_destination_uri))
# [END bigquery_extract_table]
This code is splitting the large dataset and compressing into .gz
format but it is returning multiple compressed files which size is rounding between 40MB to 70MB.
I am trying to generate the compressed file with the size of 1GB (each file). Is there any way to get this done?
Upvotes: 7
Views: 8078
Reputation: 327
As other members commented, it is not possible. I think this issue more than having to deal with a google specific algorithm, it deals with the question of whether or not compressed files could have its compressed size calculated before doing it's actual compression and the answer is not.
Because all compression techniques do make use of some sort of dictionary, the final size of any file is known only after all symbols had been translated, which means the compression has been done. There is another post of SO in which this is discussed, and this article does explain some fundamentals about compression.
Files could be "padded" in order to make them of a uniform bigger size, but that would defeat the objective of compression (which is of saving space).
Upvotes: 0
Reputation: 403
I think it is possible. You need to know the total size of your export and knowing this you can split the results using Multiple wildcard URIs. {1}
For example, if your export is 10GB you can define 10 wildcard URIs and they will be 1GB of size aprox.
You have a similar question answered here: {2}
{1}: https://cloud.google.com/bigquery/docs/exporting-data#exporting_data_into_one_or_more_files
{2}: Exporting data to GCS from BigQuery - Split file size control
Upvotes: 0
Reputation: 1099
Unfortunately no - Google adjust it by itself - you do not have options to specify size. I believe it is because of size of uncompressed data (so each BQ worker produced one file and it is impossible to produce one file from multiple workers)
Upvotes: 5