Reputation: 123

Timeout error in SELECT request with timestamp range

Here's my situation :

VPS Server :
- 1 Cassandra Database :
  - Keyspace : "atim_cloud"
    - Table : "messages"

CREATE TABLE atim_cloud.messages (
    deviceid text,
    channelname text,
    time timestamp,
    avgsignal float,
    latitude float,
    longitude float,
    rssi float,
    snr float,
    stationid text,
    value blob,
    valuetype text,
    PRIMARY KEY ((deviceid, channelname), time)
) WITH CLUSTERING ORDER BY (time DESC)
    AND bloom_filter_fp_chance = 0.01
    AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
    AND comment = ''
    AND compaction = {'min_threshold': '4', 'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32'}
    AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
    AND dclocal_read_repair_chance = 0.1
    AND default_time_to_live = 0
    AND gc_grace_seconds = 864000
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND read_repair_chance = 0.0
    AND speculative_retry = '99.0PERCENTILE';
CREATE INDEX messages_deviceid_idx ON atim_cloud.messages (deviceid);
CREATE INDEX messages_channelname_idx ON atim_cloud.messages (channelname);
CREATE INDEX messages_time_idx ON atim_cloud.messages (time);

My problem :

This table was made for big quantity of data (millions of rows). Simple request works fine like :

SELECT * FROM messages WHERE deviceid ='1DB8D';

And I get :

 deviceid | channelname | time                     | avgsignal | latitude | longitude | rssi   | snr   | stationid | value      | valuetype
----------+-------------+--------------------------+-----------+----------+-----------+--------+-------+-----------+------------+-----------
    1DB8D |      INDEX1 | 2015-07-26 22:21:59+0200 |       9.9 |       45 |         6 | -125.5 |  9.66 |      0E00 | 0x00000000 |       int
    1DB8D |      INDEX1 | 2015-07-26 22:11:58+0200 |      9.89 |       45 |         6 | -125.5 |  9.85 |      0E00 | 0x00000000 |       int
    1DB8D |      INDEX1 | 2015-07-26 22:01:59+0200 |      9.87 |       45 |         6 | -123.5 | 10.08 |      0E00 | 0x00000000 |       int
    1DB8D |      INDEX1 | 2015-07-26 21:51:59+0200 |      9.83 |       45 |         6 | -125.5 |   9.8 |      0E00 | 0x00000000 |       int
    1DB8D |      INDEX1 | 2015-07-26 21:41:59+0200 |      9.83 |       45 |         6 | -124.5 | 10.02 |      0E00 | 0x00000000 |       int
    1DB8D |      INDEX1 | 2015-07-26 21:31:58+0200 |       9.8 |       45 |         6 | -126.5 | 10.35 |      0E00 | 0x00000000 |       int
    1DB8D |      INDEX1 | 2015-07-26 21:21:59+0200 |      9.78 |       45 |         6 | -122.5 |  9.91 |      0E00 | 0x00000000 |       int
    1DB8D |      INDEX1 | 2015-07-26 21:11:59+0200 |      9.82 |       45 |         6 | -130.5 |  8.85 |      0E00 | 0x00000000 |       int
    1DB8D |      INDEX1 | 2015-07-26 21:01:59+0200 |      9.79 |       45 |         6 | -129.5 | 10.11 |      0E00 | 0x00000000 |       int
    1DB8D |      INDEX1 | 2015-07-26 20:51:58+0200 |      9.77 |       45 |         6 | -124.5 | 10.06 |      0E00 | 0x00000000 |       int
    1DB8D |      INDEX1 | 2015-07-26 20:41:59+0200 |      9.78 |       45 |         6 | -123.5 |  9.52 |      0E00 | 0x00000000 |       int

But when i do some more complexe request with timestamp's compute like : (It happens a few times but not all the time)

SELECT * FROM messages WHERE deviceid = '1DB8D' AND time >= 1437981692831 LIMIT 500 ALLOW FILTERING ;

or simply :

SELECT COUNT(*) FROM messages ;

I get (after some second. Timeout i guess):

errors={}, last_host=127.0.0.1

Do you have any suggestion to solve my problem ? I'm looking for some advice about Index, or Primary Key but I haven't found any thing.

If you have some tips to perform this datatable, i'm glad to hear it. What about multi-cluster ? I don't understand everything about it.

Thank you ;)

Upvotes: 2

Answers (3)

Quentin DESBOIS

Reputation: 123

I get time to try solution you gave me : So I created 2 others tables to try different request (I need at less, this 3 requests as following) :

SELECT * FROM messages WHERE deviceid = '1DB8D' AND time >= 1438853606718 ALLOW FILTERING ; 
SELECT * FROM messages WHERE deviceid = '1DB8D' AND channelname = 'brutMessage'; 
SELECT * FROM messages WHERE deviceid = '1DB8D' AND channelname = 'brutMessage' AND time >= 1438853606718;

So here's my results :

First table (messages): the old one

PRIMARY KEY ((deviceid, channelname), time) ) WITH CLUSTERING ORDER BY (time DESC)

SELECT * FROM messages WHERE deviceid = '1DB8D' AND time >= 1438853606718 ALLOW FILTERING ;  <==== errors={}, last_host=127.0.0.1
SELECT * FROM messages WHERE deviceid = '1DB8D' AND channelname = 'brutMessage'; <=== OK
SELECT * FROM messages WHERE deviceid = '1DB8D' AND channelname = 'brutMessage' AND time >= 1438853606718; <=== OK

Second table (messages_test):

PRIMARY KEY (deviceid, time, channelname)) WITH CLUSTERING ORDER BY (time DESC, channelname ASC)

SELECT * FROM messages_test WHERE deviceid = '1DB8D' AND time >= 1438853606718 ; <==== OK
SELECT * FROM messages_test WHERE deviceid = '1DB8D' AND channelname='brutMessage' AND time >= 1438853606718 ; <==== code=2200 [Invalid query] message="PRIMARY KEY column "channelname" cannot be restricted 
                                                                                                                    (preceding column "ColumnDefinition{name=time, type=org.apache.cassandra.db.marshal.ReversedType
                                                                                                                    (org.apache.cassandra.db.marshal.TimestampType), kind=CLUSTERING_COLUMN, componentIndex=0, 
                                                                                                                    indexName=null, indexType=null}" is either not restricted or by a non-EQ relation)"
                                                                                                                    
SELECT * FROM messages_test WHERE deviceid = '1DB8D' AND channelname='brutMessage'; <====  code=2200 [Invalid query] message="PRIMARY KEY column "channelname" cannot be restricted 
                                                                                            (preceding column "ColumnDefinition{name=time, type=org.apache.cassandra.db.marshal.ReversedType
                                                                                            (org.apache.cassandra.db.marshal.TimestampType), kind=CLUSTERING_COLUMN, componentIndex=0, 
                                                                                            indexName=null, indexType=null}" is either not restricted or by a non-EQ relation)"

Third table (messages_test2):

PRIMARY KEY (deviceid, channelname, time)) WITH CLUSTERING ORDER BY (channelname ASC, time DESC)

SELECT * FROM messages_test2 WHERE deviceid = '1DB8D' AND time >= 1438853606718 ; <==== code=2200 [Invalid query] message="PRIMARY KEY column "time" cannot be restricted 
                                                                                        (preceding column "ColumnDefinition{name=channelname, type=org.apache.cassandra.db.marshal.UTF8Type,
                                                                                        kind=CLUSTERING_COLUMN, componentIndex=0, indexName=null, indexType=null}" 
                                                                                        is either not restricted or by a non-EQ relation)"
SELECT * FROM messages_test2 WHERE deviceid = '1DB8D' AND channelname = 'brutMessage' AND time >= 1438853606718 ; <==== OK
SELECT * FROM messages_test2 WHERE deviceid = '1DB8D' AND channelname = 'brutMessage'; <====== OK

Do you have any suggestion to success my 3 requests ?

PS : Thanks to @BryceAtNetwork23 who explain to me how duplicate table with COPY TO/FROM method.

Upvotes: 0

Aaron

Reputation: 57738

Just to echo what Jim said, but secondary indexes were not really intended to be a "magic bullet," used to fix the shortcomings of your data model. What you will want to do is build your data model around your queries.

Just curious, but what is channelname used for? You have it as a partition key, but it looks like you really don't want to query by it. Are you using it as a "bucket" to help keep your columns-per-partition under 2 billion? Or is it just for extra uniqueness?

Assuming you don't need it as a "bucket," if I restructure your PRIMARY KEY like this:

PRIMARY KEY (deviceid, time, channelname)

Then, this query works:

aploetz@cqlsh:stackoverflow2> SELECT * FROM messages WHERE deviceid ='1DB8D';

 deviceid | time                     | channelname | avgsignal | latitude | longitude | rssi   | snr   | stationid | value      | valuetype
----------+--------------------------+-------------+-----------+----------+-----------+--------+-------+-----------+------------+-----------
    1DB8D | 2015-07-26 15:21:59-0500 |      INDEX1 |       9.9 |       45 |         6 | -125.5 |  9.66 |      0E00 | 0x00000000 |       int
    1DB8D | 2015-07-26 15:11:58-0500 |      INDEX1 |      9.89 |       45 |         5 | -125.5 |  9.85 |      0E00 | 0x00000000 |       int
    1DB8D | 2015-07-26 15:01:59-0500 |      INDEX1 |      9.87 |       45 |         5 | -123.5 | 10.08 |      0E00 | 0x00000000 |       int
    1DB8D | 2015-07-26 14:51:59-0500 |      INDEX1 |      9.83 |       45 |         5 | -125.5 |   9.8 |      0E00 | 0x00000000 |       int
    1DB8D | 2015-07-26 14:41:59-0500 |      INDEX1 |      9.83 |       45 |         5 | -124.5 | 10.02 |      0E00 | 0x00000000 |       int
    1DB8D | 2015-07-26 14:31:58-0500 |      INDEX1 |       9.8 |       45 |         5 | -126.5 | 10.35 |      0E00 | 0x00000000 |       int
    1DB8D | 2015-07-26 14:21:59-0500 |      INDEX1 |      9.78 |       45 |         5 | -122.5 |  9.91 |      0E00 | 0x00000000 |       int
    1DB8D | 2015-07-26 14:11:59-0500 |      INDEX1 |      9.82 |       45 |         5 | -130.5 |  8.85 |      0E00 | 0x00000000 |       int
    1DB8D | 2015-07-26 14:01:59-0500 |      INDEX1 |      9.79 |       45 |         5 | -129.5 | 10.11 |      0E00 | 0x00000000 |       int
    1DB8D | 2015-07-26 13:51:58-0500 |      INDEX1 |      9.77 |       45 |         5 | -124.5 | 10.06 |      0E00 | 0x00000000 |       int
    1DB8D | 2015-07-26 13:41:59-0500 |      INDEX1 |      9.78 |       45 |         5 | -123.5 |  9.52 |      0E00 | 0x00000000 |       int

(11 rows)

As does this one:

aploetz@cqlsh:stackoverflow2> SELECT * FROM messages WHERE deviceid ='1DB8D' AND time >='2015-07-26 14:00:00-0500';

 deviceid | time                     | channelname | avgsignal | latitude | longitude | rssi   | snr   | stationid | value      | valuetype
----------+--------------------------+-------------+-----------+----------+-----------+--------+-------+-----------+------------+-----------
    1DB8D | 2015-07-26 15:21:59-0500 |      INDEX1 |       9.9 |       45 |         6 | -125.5 |  9.66 |      0E00 | 0x00000000 |       int
    1DB8D | 2015-07-26 15:11:58-0500 |      INDEX1 |      9.89 |       45 |         5 | -125.5 |  9.85 |      0E00 | 0x00000000 |       int
    1DB8D | 2015-07-26 15:01:59-0500 |      INDEX1 |      9.87 |       45 |         5 | -123.5 | 10.08 |      0E00 | 0x00000000 |       int
    1DB8D | 2015-07-26 14:51:59-0500 |      INDEX1 |      9.83 |       45 |         5 | -125.5 |   9.8 |      0E00 | 0x00000000 |       int
    1DB8D | 2015-07-26 14:41:59-0500 |      INDEX1 |      9.83 |       45 |         5 | -124.5 | 10.02 |      0E00 | 0x00000000 |       int
    1DB8D | 2015-07-26 14:31:58-0500 |      INDEX1 |       9.8 |       45 |         5 | -126.5 | 10.35 |      0E00 | 0x00000000 |       int
    1DB8D | 2015-07-26 14:21:59-0500 |      INDEX1 |      9.78 |       45 |         5 | -122.5 |  9.91 |      0E00 | 0x00000000 |       int
    1DB8D | 2015-07-26 14:11:59-0500 |      INDEX1 |      9.82 |       45 |         5 | -130.5 |  8.85 |      0E00 | 0x00000000 |       int
    1DB8D | 2015-07-26 14:01:59-0500 |      INDEX1 |      9.79 |       45 |         5 | -129.5 | 10.11 |      0E00 | 0x00000000 |       int

(9 rows)

What is more important, is that they work without secondary indexes or the ALLOW FILTERING directive. This PRIMARY KEY also supports complete partition key (deviceid) queries, so these queries should perform well.

SELECT COUNT(*) FROM messages ;

Yeah, the Cassandra/CQL implementation of COUNT doesn't perform nearly as well as it's RDBMS/SQL cousin. Apple's Richard Low (and DataStax MVP) wrote an article ("Counting Keys In Cassandra") about how COUNT works "under the hood," and it's definitely worth a read. Essentially, COUNT has to iterate through EVERY ROW to give you that number. So if you are running that against a large column family (table), that will time-out.

Note: If you find that you do need a bucket (to keep your columns-per-partition below 2 billion) you could pick a time/date increment that makes sense for your data, and query within that window. For instance, above you could make the date your bucket (as a text field) build your partition key as PRIMARY KEY ((deviceid, datebucket), time, channelname) and then query like this:

SELECT * FROM messages WHERE deviceid ='1DB8D' AND datebucket='20150726';

Maybe see if something like that makes sense for you.

EDIT: Patrick McFadin of DataStax wrote a great article on Getting Started With Time Series Data Modeling. In this article he describes using a "time bucket" to further partition a table in a time series application. This saves you from hitting the 2 billion column-per-partition limit, and also helps Cassandra avoid (older) data that you don't really care about at that moment.

Upvotes: 1

Jim Meyer

Reputation: 9475

Secondary indexes are only recommended for fields with low cardinality. They are very inefficient for fields with high cardinality such as your time field. That is why you will get timeout errors when you use your time field in a query.

In Cassandra you should concentrate on using good primary keys rather than fixing schema problems by creating secondary keys.

Upvotes: 3