Reputation: 1872
Running a Spark job in Google Cloud Dataproc. Using the BigQuery Connector to load json data output from the job into a BigQuery table.
BigQuery Standard-SQL data types documentation states the ARRAY type is supported.
My Scala code is:
val outputDatasetId = "mydataset"
val tableSchema = "["+
"{'name': '_id', 'type': 'STRING'},"+
"{'name': 'array1', 'type': 'ARRAY'},"+
"{'name': 'array2', 'type': 'ARRAY'},"+
"{'name': 'number1', 'type': 'FLOAT'}"+
"]"
// Output configuration
BigQueryConfiguration.configureBigQueryOutput(
conf, projectId, outputDatasetId, "outputTable",
tableSchema)
//Write visits to BigQuery
jsonData.saveAsNewAPIHadoopDataset(conf)
But the job throws this exception:
{
"code" : 400,
"errors" : [ {
"domain" : "global",
"message" : "Invalid value for: ARRAY is not a valid value",
"reason" : "invalid"
} ],
"message" : "Invalid value for: ARRAY is not a valid value"
}
at
com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel.waitForCompletionAnThrowIfUploadFailed(AbstractGoogleAsyncWriteChannel.java:432)
at com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel.close(AbstractGoogleAsyncWriteChannel.java:287)
at com.google.cloud.hadoop.io.bigquery.BigQueryRecordWriter.close(BigQueryRecordWriter.java:358)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$5.apply$mcV$sp(PairRDDFunctions.scala:1124)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1366)
... 8 more
Caused by: com.google.api.client.googleapis.json.GoogleJsonResponseException:
400 Bad Request
Is this a Legacy vs. Standard SQL issue? Or is ARRAY type not supported by the BigQuery Connector for Spark?
Upvotes: 3
Views: 4499
Reputation: 33765
Are these arrays of strings? Integers? I believe that using this API, you need to set the type
to the element type, e.g. STRING
or INT64
, but use a mode
of REPEATED
. The BigQuery APIs have not yet been fully updated to use standard SQL types everywhere, so you need to use the legacy convention of type + mode instead.
Upvotes: 2
Reputation: 11797
Instead of using type=ARRAY
, try setting the type
as you normally would but also set the key mode=REPEATED
.
An array of strings for instance would be defined as:
{'name': 'field1', 'type': 'STRING', 'mode': 'REPEATED'}
Upvotes: 12