Reputation: 439
I'm using bigquery-spark-connector to read from BigQuer, which uses BigQuery Storage API. My script (automatically) requests multiple partitions from BigQuery Storage API but I get the warning:
WARN com.google.cloud.spark.bigquery.direct.DirectBigQueryRelation: Requested 2 partitions, but only received 1 from the BigQuery Storage API
The Spark job takes very long and I think it's because it's not reading through multiple partitions. How can I make sure that BigQuery Storage API gives me all the partitions I'm asking for? What's happening here, why is it only giving me a single partition, no matter how many I request?
First I create a SparkSession:
SparkSession spark = SparkSession.builder()
.appName("XXX")
.getOrCreate();
This is the code causing the WARN:
Dataset<Row> data = spark.read()
.format("bigquery")
.option("table","project.dataset.table")
.load()
.cache();
Upvotes: 4
Views: 2840
Reputation: 30448
The spark-bigquery-connector uses some heuristics to ask when asking for partitions from the BigQuery storage API. The returned partitions are the actual partitions that BigQuery uses, which may be lower than what the heuristics has predicted. This is a normal case, so perhaps a warning is a bit too severe for this case (I've discussed this also with the BigQuery team). For further context read the description of the requestedStreams parameter here.
The second issue is that the Spark job takes very long. If increasing the resources - especially the number of executors doesn't help, please open a bug in the spark-bigquery-connector project with the actual stream id and the rest of the spark configuration so that the connector and BoigQuery teams would be able to check it out.
Upvotes: 1