Google Cloud Bigtable: repeated grpc error code 13, then suddenly success

Question

In short, we are sometimes seeing that a small number of Cloud Bigtable queries fail repeatedly (for 10s or even 100s of times in a row) with the error rpc error: code = 13 desc = "server closed the stream without sending trailers" until (usually) the query finally works.

In detail, our setup is as follows:

We are running a collection (< 10) of Go services on Google Compute Engine. Each service leases tasks from a pair of PULL task queues. Each task contains an ID of a bigtable row. The task handler executes the following query:

row, err := tbl.ReadRow(ctx, , 
    bigtable.RowFilter(bigtable.ChainFilters(
        bigtable.FamilyFilter(),
        bigtable.LatestNFilter(1))))

If the query fails then the task handler simply returns. Since we lease tasks with a lease time between 10 and 15 minutes, a little while later the lease will expire on that task, it will be lease again, and we'll retry. The tasks have a max retry of 1000 so they can be retried many times over a long period. In a small number of cases, a particular task will fail with the grpc error above. The task will typically fail with this same error every time it runs for hours or days on end, before (seemingly out of the blue) eventually succeeding (or the task runs out of retries and dies).

Since this often takes so long, it seems unrelated to server load. For example right now on a Sunday morning, these servers are very lightly loaded, and yet I see plenty of these errors when I tail the logs. From this answer, I had originally thought that this might be due to trying to query for a large amount of data, perhaps near the max limit that cloud bigtable will support. However I now see that this is not the case; I can find many examples where tasks that have failed many times finally succeed and report only a small amount of data (e.g. <1 MB) was retrieved.

What else should I be looking at here?

edit: From further testing I now know that this is completely machine (client) independent. If I tail the log on one of the task leasing machines, wait for a "server closed the stream without sending trailers" error, and then try a one-off ReadRow query to the same rowId from another, unrelated, totally unused machine, I get the same error repeatedly.

Google Cloud Bigtable: repeated grpc error code 13, then suddenly success

Answers (1)

Related Questions