Reputation: 1751
I am using Spark to load some data into BigQuery. The idea is to read data from S3 and use Spark and BigQuery client API to load data. Below is the code that does the insert into BigQuery.
val bq = createAuthorizedClientWithDefaultCredentialsFromStream(appName, credentialStream)
val bqjob = bq.jobs().insert(pid, job, data).execute() // data is a InputStream content
With this approach, I am seeing lot of SocketTimeoutException.
Caused by: java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:170)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
at sun.security.ssl.InputRecord.read(InputRecord.java:503)
at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:954)
at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:911)
at sun.security.ssl.AppInputStream.read(AppInputStream.java:105)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:703)
at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:647)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1534)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1439)
at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480)
at sun.net.www.protocol.https.HttpsURLConnectionImpl.getResponseCode(HttpsURLConnectionImpl.java:338)
at com.google.api.client.http.javanet.NetHttpResponse.<init>(NetHttpResponse.java:37)
at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:94)
at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:972)
at com.google.api.client.googleapis.media.MediaHttpUploader.executeCurrentRequestWithoutGZip(MediaHttpUploader.java:545)
at com.google.api.client.googleapis.media.MediaHttpUploader.executeCurrentRequest(MediaHttpUploader.java:562)
at com.google.api.client.googleapis.media.MediaHttpUploader.resumableUpload(MediaHttpUploader.java:419)
at com.google.api.client.googleapis.media.MediaHttpUploader.upload(MediaHttpUploader.java:336)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:427)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:469)
Looks like the delay in reading from S3 causes Google http-client to timeout. I wanted to increase the timeout and tried the below options.
val req = bq.jobs().insert(pid, job, data).buildHttpRequest()
req.setReadTimeout(3 * 60 * 1000)
val res = req.execute()
But this causes a Precondition failure in BigQuery. It expects the mediaUploader to be null, not sure why though.
Exception in thread "main" java.lang.IllegalArgumentException
at com.google.api.client.repackaged.com.google.common.base.Preconditions.checkArgument(Preconditions.java:76)
at com.google.api.client.util.Preconditions.checkArgument(Preconditions.java:37)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.buildHttpRequest(AbstractGoogleClientRequest.java:297)
This caused me to try the second insert API on BigQuery
val req = bq.jobs().insert(pid, job).buildHttpRequest().setReadTimeout(3 * 60 * 1000).setContent(data)
val res = req.execute()
And this time it failed with a different error.
Exception in thread "main" com.google.api.client.googleapis.json.GoogleJsonResponseException: 400 Bad Request
{
"code" : 400,
"errors" : [ {
"domain" : "global",
"message" : "Job configuration must contain exactly one job-specific configuration object (e.g., query, load, extract, spreadsheetExtract), but there were 0: ",
"reason" : "invalid"
} ],
"message" : "Job configuration must contain exactly one job-specific configuration object (e.g., query, load, extract, spreadsheetExtract), but there were 0: "
}
Please suggest me how I can set the timeout. Also point me if I am doing something wrong.
Upvotes: 0
Views: 2869
Reputation: 2057
I'll answer the main question from the title: how to set timeouts using the Java client library.
To set timeouts, you need a custom HttpRequestInitializer configured in your client. For example:
Bigquery.Builder builder =
new Bigquery.Builder(new UrlFetchTransport(), new JacksonFactory(), credential);
final HttpRequestInitializer existing = builder.getHttpRequestInitializer();
builder.setHttpRequestInitializer(new HttpRequestInitializer() {
@Override
public void initialize(HttpRequest request) throws IOException {
existing.initialize(request);
request
.setReadTimeout(READ_TIMEOUT)
.setConnectTimeout(CONNECTION_TIMEOUT);
}
});
Bigquery client = builder.build();
I don't think this will solve all the issues you are facing. A few ideas that might be helpful, but I don't fully understand the scenario so these may be off track:
bigquery.tabledata.insertAll
may be more appropriate for large fan-in scenarios like this. See https://cloud.google.com/bigquery/streaming-data-into-bigquery for more details.Thanks for the question!
Upvotes: 1