Reputation: 349
I was profiling an application that uses Cassandra and it turned out that reads were the bottleneck. At closer inspection it seems they take way to long, I would really appreciate some help in understanding why.
The application reads always the whole set of rows for a given partition key (the query is of the form SELECT * FROM table WHERE partition_key = ?
). Unsurprisingly, the read time is O(number of rows for partition key)
, however the constant, seems way to high. After examining the query plan it turns out that the majority of time is spent on the "merging data from mem and sstables".
This step takes over 200ms for a partition key of ~5000 rows, where a row consists of 9 columns, and is less than 100 bytes. Given the read throughput of a SSD, reading sequentially 0.5MB should happen instantaneously.
Actually, I doubt this is to do with I/O at all. The machine used to have a spinning disk which was replaced with the SSD it has now. The change had no impact on query performance. I think there is something very involved in Cassandra processing or how it reads the data of disk that makes this operation very expensive.
Merging from more than one SSTable or iterating over tombstoned cells does not explain this. First of all, it should take milliseconds, second of all this is happening consistently, regardless if it is 2 or 4 SSTables and whether there are or not tombstoned cells.
To give some background:
Hardware: The machine that is running Cassandra is an 8 core, bare metal and SSD backed. I query it from cqlsh on the machine, the data is stored locally. There is no other load on it and looking at iostats, there is also barely any i/o.
Data model: The partition key, PK
, is of text
type, the primary key is a composite of the partition key and a bigint
column K
, and the rest are 7 mutable columns. The schema creation command is listed below.
CREATE TABLE inboxes (
PK text,
K bigint,
A boolean,
B boolean,
C boolean,
D boolean,
E bigint,
E bigint,
F int,
PRIMARY KEY (PK, K)
) WITH CLUSTERING ORDER BY (K DESC));
This is an example trace, with 3 SSTable involved, an a quite large number of tombstones.
activity | timestamp | source | source_elapsed
-------------------------------------------------------------------------------------------+--------------+-------------+----------------
execute_cql3_query | 03:14:07,507 | 10.161.4.77 | 0
Parsing select * from table where PK = 'key_value' LIMIT 10000;| 03:14:07,508 | 10.161.4.77 | 123
Preparing statement | 03:14:07,508 | 10.161.4.77 | 244
Executing single-partition query on table | 03:14:07,509 | 10.161.4.77 | 1155
Acquiring sstable references | 03:14:07,509 | 10.161.4.77 | 1173
Merging memtable tombstones | 03:14:07,509 | 10.161.4.77 | 1195
Key cache hit for sstable 2906 | 03:14:07,509 | 10.161.4.77 | 1231
Seeking to partition beginning in data file | 03:14:07,509 | 10.161.4.77 | 1240
Key cache hit for sstable 1533 | 03:14:07,509 | 10.161.4.77 | 1550
Seeking to partition beginning in data file | 03:14:07,509 | 10.161.4.77 | 1561
Key cache hit for sstable 1316 | 03:14:07,509 | 10.161.4.77 | 1867
Seeking to partition beginning in data file | 03:14:07,509 | 10.161.4.77 | 1878
Merging data from memtables and 3 sstables | 03:14:07,510 | 10.161.4.77 | 2180
Read 5141 live and 1944 tombstoned cells | 03:14:07,646 | 10.161.4.77 | 138734
Request complete | 03:14:07,742 | 10.161.4.77 | 235030
Upvotes: 1
Views: 1822
Reputation: 19377
You're not just "reading sequentially 0.5MB", you're asking Cassandra to turn it into rows, filter out tombstones (deleted rows), and turn it into a resultset. 0.04ms per row is pretty reasonable; my rule of thumb is 0.5ms per 10 rows for an entire query.
Remember that Cassandra optimizes for short requests suitable for online applications; 10 to 100 row resultsets are typical. There is no parallelization within a single query.
Upvotes: 2