Cassandra: selecting first entry for each value of an indexed column

Question

I have a table of events and would like to extract the first timestamp (column unixtime) for each user. Is there a way to do this with a single Cassandra query?

The schema is the following:

CREATE TABLE events (
 id VARCHAR,
 unixtime bigint,
 u bigint,
 type VARCHAR,
 payload map, 
 PRIMARY KEY(id)
);

CREATE INDEX events_u
  ON events (u);

CREATE INDEX events_unixtime
  ON events (unixtime);

CREATE INDEX events_type
  ON events (type);

ashic · Accepted Answer

According to your schema, each user will have a single time stamp. If you want one event per entry, consider:

PRIMARY KEY (id, unixtime).

Assuming that is your schema, the entries for a user will be stored in ascending unixtime order. Be careful though...if it's an unbounded event stream and users have lots of events, the partition for the id will grow and grow. It's recommended to keep partition sizes to tens or hundreds of megs. If you anticipate larger, you'll need to start some form of bucketing.

Now, on to your query. In a word, no. If you don't hit a partition (by specifying the partition key), your query becomes a cluster wide operation. With little data it'll work. But with lots of data, you'll get timeouts. If you do have the data in its current form, then I recommend you use the Cassandra Spark connector and Apache Spark to do your query. An added benefit of the spark connectory is that if you have cassandra nodes as spark worker nodes, due to locality, you can efficiently hit a secondary index without specifying the partition key (which would normally cause a cluster wide query with timeout issues, etc.). You could even use Spark to get the required data and store it into another cassandra table for fast querying.

Cassandra: selecting first entry for each value of an indexed column

Answers (1)

Related Questions