Reputation: 1588
When I query secondary index with pagination, query becomes slower when data grows.
I thought with pagination, no matter how large your data grow, it takes same time to query one page. Is that true? Why my query get slower?
My simplified table is
CREATE TABLE closed_executions (
domain_id uuid,
workflow_id text,
start_time timestamp,
workflow_type_name text,
PRIMARY KEY ((domain_id), start_time)
) WITH CLUSTERING ORDER BY (start_time DESC)
AND COMPACTION = {
'class': 'org.apache.cassandra.db.compaction.LeveledCompactionStrategy'
}
AND GC_GRACE_SECONDS = 172800;
And I create a secondary index as
CREATE INDEX closed_by_type ON closed_executions (workflow_type_name);
I query with following CQL
SELECT workflow_id, start_time, workflow_type_name
FROM closed_executions
WHERE domain_id = ?
AND start_time >= ?
AND start_time <= ?
AND workflow_type_name = ?
and code
query := v.session.Query(templateGetClosedWorkflowExecutionsByType,
request.DomainUUID,
common.UnixNanoToCQLTimestamp(request.EarliestStartTime),
common.UnixNanoToCQLTimestamp(request.LatestStartTime),
request.WorkflowTypeName).Consistency(gocql.One)
iter := query.PageSize(request.PageSize).PageState(request.NextPageToken).Iter()
// PageSize is 10, but could be thousand
Environement:
Observation:
10K rows, within second
100K rows, ~3 second
1M rows, ~17 second
Debug log:
INFO [ScheduledTasks:1] 2018-09-11 16:29:48,349 NoSpamLogger.java:91 - Some operations were slow, details available at debug level (debug.log)
DEBUG [ScheduledTasks:1] 2018-09-11 16:29:48,357 MonitoringTask.java:173 - 1 operations were slow in the last 5005 msecs:
<SELECT * FROM cadence_visibility.closed_executions WHERE workflow_type_name = code.uber.internal/devexp/cadence-bench/load/basic.stressWorkflowExecute AND token(domain_id, domain_partition) >= token(d3138e78-abe7-48a0-adb9-8c466a9bb3fa, 0) AND token(domain_id, domain_partition) <= token(d3138e78-abe7-48a0-adb9-8c466a9bb3fa, 0) AND start_time >= 2018-09-11 16:29-0700 AND start_time <= 1969-12-31 16:00-0800 LIMIT 10>, time 2747 msec - slow timeout 500 msec
DEBUG [COMMIT-LOG-ALLOCATOR] 2018-09-11 16:31:47,774 AbstractCommitLogSegmentManager.java:107 - No segments in reserve; creating a fresh one
DEBUG [ScheduledTasks:1] 2018-09-11 16:40:22,922 ColumnFamilyStore.java:899 - Enqueuing flush of size_estimates: 23.997MiB (2%) on-heap, 0.000KiB (0%) off-heap
Related ref (no answer for my questions):
-- Edit tablestats returns
Total number of tables: 105
----------------
Keyspace : cadence_visibility
Read Count: 19
Read Latency: 0.5125263157894736 ms.
Write Count: 3220964
Write Latency: 0.04900822269357869 ms.
Pending Flushes: 0
Table: closed_executions
SSTable count: 1
SSTables in each level: [1, 0, 0, 0, 0, 0, 0, 0, 0]
Space used (live): 20.3 MiB
Space used (total): 20.3 MiB
Space used by snapshots (total): 0 bytes
Off heap memory used (total): 6.35 KiB
SSTable Compression Ratio: 0.40192660515179696
Number of keys (estimate): 3
Memtable cell count: 28667
Memtable data size: 7.35 MiB
Memtable off heap memory used: 0 bytes
Memtable switch count: 9
Local read count: 9
Local read latency: NaN ms
Local write count: 327024
Local write latency: NaN ms
Pending flushes: 0
Percent repaired: 0.0
Bloom filter false positives: 0
Bloom filter false ratio: 0.00000
Bloom filter space used: 16 bytes
Bloom filter off heap memory used: 8 bytes
Index summary off heap memory used: 38 bytes
Compression metadata off heap memory used: 6.3 KiB
Compacted partition minimum bytes: 150
Compacted partition maximum bytes: 62479625
Compacted partition mean bytes: 31239902
Average live cells per slice (last five minutes): NaN
Maximum live cells per slice (last five minutes): 0
Average tombstones per slice (last five minutes): NaN
Maximum tombstones per slice (last five minutes): 0
Dropped Mutations: 0 bytes
----------------
Upvotes: 1
Views: 728
Reputation: 1588
Why pagination doesn't scale as the main table?
Your data in your secondary index is disperse
pagination will only apply logic
until it hits the page number
since your data is not clustered by time
you still have to sift through lots and lots of rows
before you can find your first 10 for example .
Query Tracing do show pagination plays at the very late phase.
Why secondary index is slow?
First Cassandra reads the index table to retrieve the primary key of all matching rows and for each of them, it will read the original table to fetch out the data. It is known anti-patterns with low cardinality index. (reference https://www.datastax.com/dev/blog/cassandra-native-secondary-index-deep-dive)
Upvotes: 0