gsouf
gsouf

Reputation: 1133

Speed up row scanning for large datasets in Spanner

When issuing a simple query in spanner using primary index of the interleaving table for a split having millions of records then it takes long time to scan the table.

For instance SELECT COUNT(*) FROM foo WHERE foo_key="bar" where foo_key is the primary index of the interleaving table. It scans 3,000,000 rows and it takes up to 40 seconds to resolve (please note that this question does not restrict to simple COUNT query, but any query where table scan is the bottleneck).

I'm thinking of that case in BigQuery where it will use multiple processes that get merged to speed up requests.What sounds confusing in spanner is that in the execution plan we can see that spanner performs a table scan with a single execution, then a distributed union of the rows. That makes me think it could use multiple processes as well.

Are there any ways for the scan process to be spread across multiple executions to speed up the table scan?

Upvotes: 2

Views: 1654

Answers (2)

sllopis
sllopis

Reputation: 2368

There exists Schema design best practices that will definetely help make a difference in reducing the time your scan process takes. They are as follows:

  • Choosing a primary key to prevent hotspots.
  • Limiting the row size.
  • Designing interleaved tables to prevent hotspots.
  • Using descending order for timestamp-based keys.
  • Using an interleaved index on a column whose value monotonically increases or decreases.

Also, the following link will show you best practices for constructing SQL statements for Cloud Spanner to find efficient execution plans. Using secondary indexes to speed up queries or writing efficient queries for joins and range key lookup are among the few best practices listed on the previous documentation.

There is also the so-called sharded counters approach where a specific counter, such as likes for a given picture or insert you specific use case here, consists of N shards/rows. Not totally sure if applicable to your use case.

Hope it helps.

Upvotes: 1

RedPandaCurios
RedPandaCurios

Reputation: 2324

Cloud Spanner does not perform well as an analytics database - as you have seen - and so full table scans are not recommended.

Depending on your query, you may be able to limit to a key-range or multiple key-ranges to reduce the number of rows scanned, possibly in conjunction with an index...

I am not sure I understand your example query though:

SELECT COUNT(*) FROM foo WHERE foo_key="bar"

(where foo_key is the primary index.)

This should only read a single row -- as you are specifying the primary key - and return either 1 or 0 depending on if the key exists.

Upvotes: 2

Related Questions