Cassandra read timeout because of some response size limit reached?

Question

I encountered a strange behavior on cassandra 3.0:

I have the following table:

CREATE TABLE table (
  id text,
  ts text,
  score decimal,
  type text,
  values text,
  PRIMARY KEY (id, ts)
) WITH CLUSTERING ORDER BY (ts DESC)

and the following query (which returns instantly):

SELECT * FROM keyspace.table WHERE id='someId' AND ts IN ('2017-10-15','2017-10-16','2017-10-17','2017-10-18','2017-10-19','2017-10-20','2017-10-21','2017-10-22','2017-10-23','2017-10-24','2017-10-25','2017-10-26','2017-10-27','2017-10-28','2017-10-29','2017-10-30','2017-10-31','2017-11-01','2017-11-02','2017-11-03','2017-11-04','2017-11-05','2017-11-06');

If I add another day in the IN clause, the response never comes (even after 10 minutes!!!):

SELECT * FROM keyspace.table WHERE id='someId' AND ts IN ('2017-10-15','2017-10-16','2017-10-17','2017-10-18','2017-10-19','2017-10-20','2017-10-21','2017-10-22','2017-10-23','2017-10-24','2017-10-25','2017-10-26','2017-10-27','2017-10-28','2017-10-29','2017-10-30','2017-10-31','2017-11-01','2017-11-02','2017-11-03','2017-11-04','2017-11-05','2017-11-06', '2017-11-07');

The 'values' column may have large json data. There is some flag in cassandra.yaml with some size threshold or something like this? I guess adding another day in the query reaches some limit somewhere...in cassandra system.log I didn't see anything relevant to this.

Highstead · Accepted Answer

If it succeeds on one node and not another while the query will work with 1 less 'in' clause I would guess this is a memory pressure issue. To eliminate the 'query parsing problem' you can re-write your query as:

SELECT * FROM myTable WHERE id = 'x' AND ts >= '2017-10-15' AND ts <= '2017-11-07';

The in clause is only truly useful if you start bucketing your data. This is a good approach if you have hotspots or if you see 1 node with much higher load than the others.

To bucket your data you would want to do something like: CREATE TABLE table ( id text, ts text, score decimal, type text, values text, PRIMARY KEY ((id, ts), type) ) WITH CLUSTERING ORDER BY (type DESC) Your data would now be partitioned by id AND day. Your query would then become what you have now:

SELECT * FROM myTable WHERE id='x' AND ts in ('2017-01-01')

This will better distribute data on the HDDs and allow better parallelization from cassandra. This WILL NOT fix the memory pressure issue. To fix that you would want to move the aggregation of data from the coordinator to your application layer.

This means running N SELECT ... WHERE id='x' and ts = '2017-01-01'; queries.

Cassandra read timeout because of some response size limit reached?

Answers (1)

Related Questions