Reputation: 15
I have the following table (using CQL3):
create table test (
shard text,
tuuid timeuuid,
some_data text,
status text,
primary key (shard, tuuid, some_data, status)
);
I would like to get rows ordered by tuuid. But this is only possible when I restrict shard - I get this is due to performance.
I have shard purely for sharding, and I can potentially restrict its range of values to some small range [0-16) say. Then, I could run a query like this:
select * from test where shard in (0,...,15) order by tuuid limit L;
I may have millions of rows in the table, so I would like to understand the performance characteristics of such a order by query. It would seem like the performance could be pretty bad in general, BUT with a limit clause of some reasonable number (order of 10K), this may not be so bad - i.e. a 16 way merge but with a fairly low limit.
Any tips, advice or pointers into the code on where to look would be appreciated.
Upvotes: 1
Views: 1679
Reputation: 1462
Your data is sorted according to your column key. So the performance issue in your merge in your query above does not happen due to the WHERE clause but because of your LIMIT clause, afaik.
Your columns are inserted IN ORDER according to tuuid so there is no performance issue there.
If you are fetching too many rows at once, I recommended creating a test_meta table where you store the latest timeuuid every X-inserts, to get an upper bound on the rows your query will fetch. Then, you can change your query to:
select * from test where shard in (0,...,15) and tuuid > x and tuuid < y;
In short: make use of your column keys and get rid of the limit. Alternatively, in Cassandra 2.0, there will be pagination which will help here, too.
Another issue I stumbled over, you say that
I may have millions of rows in the table
But according to your data model, you will have exactly shard number of rows. This is your row key and - together with the partitioner - will determine the distribution/sharding of your data.
hope that helps!
UPDATE
From my personal experience, cassandra performances quite well during heavy reads as well as writes. If the result sets became too large, I rather experienced memory issues on the receiving/client side rather then timeouts on the server side. Still, to prevent either, I recommend having a look a the upcoming (2.0) pagination feature.
In the meanwhile:
Try to investigate using the trace functionality in 1.2.
If you are mostly reading the "latest" data, try adding a reversed type.
For general optimizations like caches etc, first, read how cassandra handles reads on a node and then, see this tuning guide.
Upvotes: 1