Reputation: 521
We have single Cassandra node.
With table:
CREATE TABLE test (
pk bigint,
b text,
d int,
PRIMARY KEY ((pk))
);
CREATE INDEX test_d_idx ON test (d);
The query in cqlsh
select * from test where d = 20140407
work fast, as well.
But DataStax driver for Spark generates query:
SELECT "d", "b" FROM "test" WHERE token("pk") > 6755909504673608635 AND d = 20140407 ALLOW FILTERING
which is slow.
SELECT partitioner FROM system.local;
returns
> org.apache.cassandra.dht.Murmur3Partitioner
Question: how to configure one node in such way, that SELECT with token("pk") would work fast?
Table contains 10^7 rows (if table contains 1000 rows, it works fast, obviously).
Upvotes: 1
Views: 414
Reputation: 57748
SELECT "d", "b" FROM "test"
WHERE token("pk") > 6755909504673608635 AND d = 20140407 ALLOW FILTERING
Wow, Spark generated that query? The reason this doesn't work (well) is that queries on secondary indexes in the WHERE
clause only work in conjunction with the PRIMARY KEY
when the PRIMARY KEY
value is specified with an equals operator. Basically the > operator is not giving Cassandra enough information to know on which partition the value is on, so it has to search them all (which is why it takes so long).
I feel the better option here, is instead of a secondary index on test(d)
, create an additional query table (and modify your application to persist data out to that table as well):
CREATE TABLE testbyd (
d int,
pk bigint,
b text,
PRIMARY KEY (d,pk)
);
While I can't say how Spark will treat that table, I can tell you that this CQL query will work on that table:
SELECT d, b
FROM testbyd
WHERE d=20140407 AND pk > 6755909504673608635;
Likewise, this CQL query:
select * from testbyd where d = 20140407;
...will perform much better than it will against the "test" table. In any case, give that a shot and see how Spark treats that query table.
Upvotes: 2