Reputation: 12264
I have used the Use the BigQuery connector with Spark to extract data from a table in BigQuery by running the code on Google Dataproc. As far as I'm aware the code shared there:
conf = {
# Input Parameters.
'mapred.bq.project.id': project,
'mapred.bq.gcs.bucket': bucket,
'mapred.bq.temp.gcs.path': input_directory,
'mapred.bq.input.project.id': 'publicdata',
'mapred.bq.input.dataset.id': 'samples',
'mapred.bq.input.table.id': 'shakespeare',
}
# Output Parameters.
output_dataset = 'wordcount_dataset'
output_table = 'wordcount_output'
# Load data in from BigQuery.
table_data = sc.newAPIHadoopRDD(
'com.google.cloud.hadoop.io.bigquery.JsonTextBigQueryInputFormat',
'org.apache.hadoop.io.LongWritable',
'com.google.gson.JsonObject',
conf=conf)
copies the entirety of the named table into input_directory
. The table I need to extract data from contains >500m rows and I don't need all of those rows. Is there a way to instead issue a query (as opposed to specifying a table) so that I can copy a subset of the data from a table?
Upvotes: 2
Views: 129
Reputation: 4457
Doesn't look like BigQuery supports any kind of filtering/querying for tables export at the moment: https://cloud.google.com/bigquery/docs/exporting-data https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs#configuration.extract
Upvotes: 1