Reputation: 123
CREATE TABLE atim_cloud.messages (
deviceid text,
channelname text,
time timestamp,
avgsignal float,
latitude float,
longitude float,
rssi float,
snr float,
stationid text,
value blob,
valuetype text,
PRIMARY KEY ((deviceid, channelname), time)
) WITH CLUSTERING ORDER BY (time DESC)
AND bloom_filter_fp_chance = 0.01
AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
AND comment = ''
AND compaction = {'min_threshold': '4', 'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';
CREATE INDEX messages_deviceid_idx ON atim_cloud.messages (deviceid);
CREATE INDEX messages_channelname_idx ON atim_cloud.messages (channelname);
CREATE INDEX messages_time_idx ON atim_cloud.messages (time);
This table was made for big quantity of data (millions of rows). Simple request works fine like :
SELECT * FROM messages WHERE deviceid ='1DB8D';
And I get :
deviceid | channelname | time | avgsignal | latitude | longitude | rssi | snr | stationid | value | valuetype
----------+-------------+--------------------------+-----------+----------+-----------+--------+-------+-----------+------------+-----------
1DB8D | INDEX1 | 2015-07-26 22:21:59+0200 | 9.9 | 45 | 6 | -125.5 | 9.66 | 0E00 | 0x00000000 | int
1DB8D | INDEX1 | 2015-07-26 22:11:58+0200 | 9.89 | 45 | 6 | -125.5 | 9.85 | 0E00 | 0x00000000 | int
1DB8D | INDEX1 | 2015-07-26 22:01:59+0200 | 9.87 | 45 | 6 | -123.5 | 10.08 | 0E00 | 0x00000000 | int
1DB8D | INDEX1 | 2015-07-26 21:51:59+0200 | 9.83 | 45 | 6 | -125.5 | 9.8 | 0E00 | 0x00000000 | int
1DB8D | INDEX1 | 2015-07-26 21:41:59+0200 | 9.83 | 45 | 6 | -124.5 | 10.02 | 0E00 | 0x00000000 | int
1DB8D | INDEX1 | 2015-07-26 21:31:58+0200 | 9.8 | 45 | 6 | -126.5 | 10.35 | 0E00 | 0x00000000 | int
1DB8D | INDEX1 | 2015-07-26 21:21:59+0200 | 9.78 | 45 | 6 | -122.5 | 9.91 | 0E00 | 0x00000000 | int
1DB8D | INDEX1 | 2015-07-26 21:11:59+0200 | 9.82 | 45 | 6 | -130.5 | 8.85 | 0E00 | 0x00000000 | int
1DB8D | INDEX1 | 2015-07-26 21:01:59+0200 | 9.79 | 45 | 6 | -129.5 | 10.11 | 0E00 | 0x00000000 | int
1DB8D | INDEX1 | 2015-07-26 20:51:58+0200 | 9.77 | 45 | 6 | -124.5 | 10.06 | 0E00 | 0x00000000 | int
1DB8D | INDEX1 | 2015-07-26 20:41:59+0200 | 9.78 | 45 | 6 | -123.5 | 9.52 | 0E00 | 0x00000000 | int
But when i do some more complexe request with timestamp's compute like : (It happens a few times but not all the time)
SELECT * FROM messages WHERE deviceid = '1DB8D' AND time >= 1437981692831 LIMIT 500 ALLOW FILTERING ;
or simply :
SELECT COUNT(*) FROM messages ;
I get (after some second. Timeout i guess):
errors={}, last_host=127.0.0.1
Do you have any suggestion to solve my problem ? I'm looking for some advice about Index, or Primary Key but I haven't found any thing.
If you have some tips to perform this datatable, i'm glad to hear it. What about multi-cluster ? I don't understand everything about it.
Thank you ;)
Upvotes: 2
Views: 450
Reputation: 123
I get time to try solution you gave me : So I created 2 others tables to try different request (I need at less, this 3 requests as following) :
SELECT * FROM messages WHERE deviceid = '1DB8D' AND time >= 1438853606718 ALLOW FILTERING ;
SELECT * FROM messages WHERE deviceid = '1DB8D' AND channelname = 'brutMessage';
SELECT * FROM messages WHERE deviceid = '1DB8D' AND channelname = 'brutMessage' AND time >= 1438853606718;
So here's my results :
PRIMARY KEY ((deviceid, channelname), time) ) WITH CLUSTERING ORDER BY (time DESC)
SELECT * FROM messages WHERE deviceid = '1DB8D' AND time >= 1438853606718 ALLOW FILTERING ; <==== errors={}, last_host=127.0.0.1
SELECT * FROM messages WHERE deviceid = '1DB8D' AND channelname = 'brutMessage'; <=== OK
SELECT * FROM messages WHERE deviceid = '1DB8D' AND channelname = 'brutMessage' AND time >= 1438853606718; <=== OK
PRIMARY KEY (deviceid, time, channelname)) WITH CLUSTERING ORDER BY (time DESC, channelname ASC)
SELECT * FROM messages_test WHERE deviceid = '1DB8D' AND time >= 1438853606718 ; <==== OK
SELECT * FROM messages_test WHERE deviceid = '1DB8D' AND channelname='brutMessage' AND time >= 1438853606718 ; <==== code=2200 [Invalid query] message="PRIMARY KEY column "channelname" cannot be restricted
(preceding column "ColumnDefinition{name=time, type=org.apache.cassandra.db.marshal.ReversedType
(org.apache.cassandra.db.marshal.TimestampType), kind=CLUSTERING_COLUMN, componentIndex=0,
indexName=null, indexType=null}" is either not restricted or by a non-EQ relation)"
SELECT * FROM messages_test WHERE deviceid = '1DB8D' AND channelname='brutMessage'; <==== code=2200 [Invalid query] message="PRIMARY KEY column "channelname" cannot be restricted
(preceding column "ColumnDefinition{name=time, type=org.apache.cassandra.db.marshal.ReversedType
(org.apache.cassandra.db.marshal.TimestampType), kind=CLUSTERING_COLUMN, componentIndex=0,
indexName=null, indexType=null}" is either not restricted or by a non-EQ relation)"
PRIMARY KEY (deviceid, channelname, time)) WITH CLUSTERING ORDER BY (channelname ASC, time DESC)
SELECT * FROM messages_test2 WHERE deviceid = '1DB8D' AND time >= 1438853606718 ; <==== code=2200 [Invalid query] message="PRIMARY KEY column "time" cannot be restricted
(preceding column "ColumnDefinition{name=channelname, type=org.apache.cassandra.db.marshal.UTF8Type,
kind=CLUSTERING_COLUMN, componentIndex=0, indexName=null, indexType=null}"
is either not restricted or by a non-EQ relation)"
SELECT * FROM messages_test2 WHERE deviceid = '1DB8D' AND channelname = 'brutMessage' AND time >= 1438853606718 ; <==== OK
SELECT * FROM messages_test2 WHERE deviceid = '1DB8D' AND channelname = 'brutMessage'; <====== OK
PS : Thanks to @BryceAtNetwork23 who explain to me how duplicate table with COPY TO/FROM method.
Upvotes: 0
Reputation: 57738
Just to echo what Jim said, but secondary indexes were not really intended to be a "magic bullet," used to fix the shortcomings of your data model. What you will want to do is build your data model around your queries.
Just curious, but what is channelname
used for? You have it as a partition key, but it looks like you really don't want to query by it. Are you using it as a "bucket" to help keep your columns-per-partition under 2 billion? Or is it just for extra uniqueness?
Assuming you don't need it as a "bucket," if I restructure your PRIMARY KEY like this:
PRIMARY KEY (deviceid, time, channelname)
Then, this query works:
aploetz@cqlsh:stackoverflow2> SELECT * FROM messages WHERE deviceid ='1DB8D';
deviceid | time | channelname | avgsignal | latitude | longitude | rssi | snr | stationid | value | valuetype
----------+--------------------------+-------------+-----------+----------+-----------+--------+-------+-----------+------------+-----------
1DB8D | 2015-07-26 15:21:59-0500 | INDEX1 | 9.9 | 45 | 6 | -125.5 | 9.66 | 0E00 | 0x00000000 | int
1DB8D | 2015-07-26 15:11:58-0500 | INDEX1 | 9.89 | 45 | 5 | -125.5 | 9.85 | 0E00 | 0x00000000 | int
1DB8D | 2015-07-26 15:01:59-0500 | INDEX1 | 9.87 | 45 | 5 | -123.5 | 10.08 | 0E00 | 0x00000000 | int
1DB8D | 2015-07-26 14:51:59-0500 | INDEX1 | 9.83 | 45 | 5 | -125.5 | 9.8 | 0E00 | 0x00000000 | int
1DB8D | 2015-07-26 14:41:59-0500 | INDEX1 | 9.83 | 45 | 5 | -124.5 | 10.02 | 0E00 | 0x00000000 | int
1DB8D | 2015-07-26 14:31:58-0500 | INDEX1 | 9.8 | 45 | 5 | -126.5 | 10.35 | 0E00 | 0x00000000 | int
1DB8D | 2015-07-26 14:21:59-0500 | INDEX1 | 9.78 | 45 | 5 | -122.5 | 9.91 | 0E00 | 0x00000000 | int
1DB8D | 2015-07-26 14:11:59-0500 | INDEX1 | 9.82 | 45 | 5 | -130.5 | 8.85 | 0E00 | 0x00000000 | int
1DB8D | 2015-07-26 14:01:59-0500 | INDEX1 | 9.79 | 45 | 5 | -129.5 | 10.11 | 0E00 | 0x00000000 | int
1DB8D | 2015-07-26 13:51:58-0500 | INDEX1 | 9.77 | 45 | 5 | -124.5 | 10.06 | 0E00 | 0x00000000 | int
1DB8D | 2015-07-26 13:41:59-0500 | INDEX1 | 9.78 | 45 | 5 | -123.5 | 9.52 | 0E00 | 0x00000000 | int
(11 rows)
As does this one:
aploetz@cqlsh:stackoverflow2> SELECT * FROM messages WHERE deviceid ='1DB8D' AND time >='2015-07-26 14:00:00-0500';
deviceid | time | channelname | avgsignal | latitude | longitude | rssi | snr | stationid | value | valuetype
----------+--------------------------+-------------+-----------+----------+-----------+--------+-------+-----------+------------+-----------
1DB8D | 2015-07-26 15:21:59-0500 | INDEX1 | 9.9 | 45 | 6 | -125.5 | 9.66 | 0E00 | 0x00000000 | int
1DB8D | 2015-07-26 15:11:58-0500 | INDEX1 | 9.89 | 45 | 5 | -125.5 | 9.85 | 0E00 | 0x00000000 | int
1DB8D | 2015-07-26 15:01:59-0500 | INDEX1 | 9.87 | 45 | 5 | -123.5 | 10.08 | 0E00 | 0x00000000 | int
1DB8D | 2015-07-26 14:51:59-0500 | INDEX1 | 9.83 | 45 | 5 | -125.5 | 9.8 | 0E00 | 0x00000000 | int
1DB8D | 2015-07-26 14:41:59-0500 | INDEX1 | 9.83 | 45 | 5 | -124.5 | 10.02 | 0E00 | 0x00000000 | int
1DB8D | 2015-07-26 14:31:58-0500 | INDEX1 | 9.8 | 45 | 5 | -126.5 | 10.35 | 0E00 | 0x00000000 | int
1DB8D | 2015-07-26 14:21:59-0500 | INDEX1 | 9.78 | 45 | 5 | -122.5 | 9.91 | 0E00 | 0x00000000 | int
1DB8D | 2015-07-26 14:11:59-0500 | INDEX1 | 9.82 | 45 | 5 | -130.5 | 8.85 | 0E00 | 0x00000000 | int
1DB8D | 2015-07-26 14:01:59-0500 | INDEX1 | 9.79 | 45 | 5 | -129.5 | 10.11 | 0E00 | 0x00000000 | int
(9 rows)
What is more important, is that they work without secondary indexes or the ALLOW FILTERING
directive. This PRIMARY KEY also supports complete partition key (deviceid
) queries, so these queries should perform well.
SELECT COUNT(*) FROM messages ;
Yeah, the Cassandra/CQL implementation of COUNT
doesn't perform nearly as well as it's RDBMS/SQL cousin. Apple's Richard Low (and DataStax MVP) wrote an article ("Counting Keys In Cassandra") about how COUNT
works "under the hood," and it's definitely worth a read. Essentially, COUNT
has to iterate through EVERY ROW to give you that number. So if you are running that against a large column family (table), that will time-out.
Note: If you find that you do need a bucket (to keep your columns-per-partition below 2 billion) you could pick a time/date increment that makes sense for your data, and query within that window. For instance, above you could make the date your bucket (as a text
field) build your partition key as PRIMARY KEY ((deviceid, datebucket), time, channelname)
and then query like this:
SELECT * FROM messages WHERE deviceid ='1DB8D' AND datebucket='20150726';
Maybe see if something like that makes sense for you.
EDIT: Patrick McFadin of DataStax wrote a great article on Getting Started With Time Series Data Modeling. In this article he describes using a "time bucket" to further partition a table in a time series application. This saves you from hitting the 2 billion column-per-partition limit, and also helps Cassandra avoid (older) data that you don't really care about at that moment.
Upvotes: 1
Reputation: 9475
Secondary indexes are only recommended for fields with low cardinality. They are very inefficient for fields with high cardinality such as your time field. That is why you will get timeout errors when you use your time field in a query.
In Cassandra you should concentrate on using good primary keys rather than fixing schema problems by creating secondary keys.
Upvotes: 3