Pavel Ajtkulov
Pavel Ajtkulov

Reputation: 521

Configuration for Cassandra on single node

We have single Cassandra node.

With table:

CREATE TABLE test (
  pk bigint,
  b text,
  d int,
  PRIMARY KEY ((pk))
);

CREATE INDEX test_d_idx ON test (d);

The query in cqlsh

select * from test where d = 20140407

work fast, as well.

But DataStax driver for Spark generates query:

SELECT "d", "b" FROM "test" WHERE token("pk") > 6755909504673608635 AND d = 20140407 ALLOW FILTERING

which is slow.

SELECT partitioner FROM system.local; 

returns

> org.apache.cassandra.dht.Murmur3Partitioner

Question: how to configure one node in such way, that SELECT with token("pk") would work fast?

Table contains 10^7 rows (if table contains 1000 rows, it works fast, obviously).

Upvotes: 1

Views: 414

Answers (1)

Aaron
Aaron

Reputation: 57748

SELECT "d", "b" FROM "test" 
WHERE token("pk") > 6755909504673608635 AND d = 20140407 ALLOW FILTERING

Wow, Spark generated that query? The reason this doesn't work (well) is that queries on secondary indexes in the WHERE clause only work in conjunction with the PRIMARY KEY when the PRIMARY KEY value is specified with an equals operator. Basically the > operator is not giving Cassandra enough information to know on which partition the value is on, so it has to search them all (which is why it takes so long).

I feel the better option here, is instead of a secondary index on test(d), create an additional query table (and modify your application to persist data out to that table as well):

CREATE TABLE testbyd (
  d int,
  pk bigint,
  b text,
  PRIMARY KEY (d,pk)
);

While I can't say how Spark will treat that table, I can tell you that this CQL query will work on that table:

SELECT d, b
FROM testbyd
WHERE d=20140407 AND pk > 6755909504673608635;

Likewise, this CQL query:

select * from testbyd where d = 20140407;

...will perform much better than it will against the "test" table. In any case, give that a shot and see how Spark treats that query table.

Upvotes: 2

Related Questions