Reputation: 677
In short, we are sometimes seeing that a small number of Cloud Bigtable queries fail repeatedly (for 10s or even 100s of times in a row) with the error rpc error: code = 13 desc = "server closed the stream without sending trailers"
until (usually) the query finally works.
In detail, our setup is as follows:
We are running a collection (< 10) of Go services on Google Compute Engine. Each service leases tasks from a pair of PULL task queues. Each task contains an ID of a bigtable row. The task handler executes the following query:
row, err := tbl.ReadRow(ctx, <my-row-id>,
bigtable.RowFilter(bigtable.ChainFilters(
bigtable.FamilyFilter(<my-column-family>),
bigtable.LatestNFilter(1))))
If the query fails then the task handler simply returns. Since we lease tasks with a lease time between 10 and 15 minutes, a little while later the lease will expire on that task, it will be lease again, and we'll retry. The tasks have a max retry of 1000 so they can be retried many times over a long period. In a small number of cases, a particular task will fail with the grpc error above. The task will typically fail with this same error every time it runs for hours or days on end, before (seemingly out of the blue) eventually succeeding (or the task runs out of retries and dies).
Since this often takes so long, it seems unrelated to server load. For example right now on a Sunday morning, these servers are very lightly loaded, and yet I see plenty of these errors when I tail the logs. From this answer, I had originally thought that this might be due to trying to query for a large amount of data, perhaps near the max limit that cloud bigtable will support. However I now see that this is not the case; I can find many examples where tasks that have failed many times finally succeed and report only a small amount of data (e.g. <1 MB) was retrieved.
What else should I be looking at here?
edit: From further testing I now know that this is completely machine (client) independent. If I tail the log on one of the task leasing machines, wait for a "server closed the stream without sending trailers" error, and then try a one-off ReadRow
query to the same rowId from another, unrelated, totally unused machine, I get the same error repeatedly.
Upvotes: 3
Views: 1128
Reputation: 7132
This error is typically caused by having more than 256MB of data in your reply.
However, there is currently a bug in our server side error handling code that allows some invalid characters in HTTP/2 trailers which is not allowed by the spec. This means that some error messages that have invalid characters will be seen as this kind of error. This should be fixed early next year.
Upvotes: 2