Can I issue a query rather than specify a table when using the BigQuery connector for Spark?

Question

I have used the Use the BigQuery connector with Spark to extract data from a table in BigQuery by running the code on Google Dataproc. As far as I'm aware the code shared there:

conf = {
    # Input Parameters.
    'mapred.bq.project.id': project,
    'mapred.bq.gcs.bucket': bucket,
    'mapred.bq.temp.gcs.path': input_directory,
    'mapred.bq.input.project.id': 'publicdata',
    'mapred.bq.input.dataset.id': 'samples',
    'mapred.bq.input.table.id': 'shakespeare',
}

# Output Parameters.
output_dataset = 'wordcount_dataset'
output_table = 'wordcount_output'

# Load data in from BigQuery.
table_data = sc.newAPIHadoopRDD(
    'com.google.cloud.hadoop.io.bigquery.JsonTextBigQueryInputFormat',
    'org.apache.hadoop.io.LongWritable',
    'com.google.gson.JsonObject',
    conf=conf)

copies the entirety of the named table into input_directory. The table I need to extract data from contains >500m rows and I don't need all of those rows. Is there a way to instead issue a query (as opposed to specifying a table) so that I can copy a subset of the data from a table?

Igor Dvorzhak · Accepted Answer

Doesn't look like BigQuery supports any kind of filtering/querying for tables export at the moment: https://cloud.google.com/bigquery/docs/exporting-data https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs#configuration.extract

Can I issue a query rather than specify a table when using the BigQuery connector for Spark?

Answers (1)

Related Questions