u6f6o
u6f6o

Reputation: 2190

Counter a better choice for uniqueness?

I currently have the following table layout for a basic user event table:

CREATE TABLE IF NOT EXISTS events.events_by_user(
    user text,
    added_week int,
    added_timestamp timestamp,
    event text,
    uuid uuid,
    PRIMARY KEY((user, added_week), added_timestamp, event, uuid))
WITH CLUSTERING ORDER BY(added_timestamp DESC)

Thus uniqueness is basically warranted by the uuid as last column of the primary key. There is a chance that several identical events for the same user occur in the same millisecond (timestamp).

Another approach might be (if I am not mistaken), to drop the uuid column and replace it by a counter column instead, like this:

CREATE TABLE IF NOT EXISTS events.events_by_user(
    user text,
    added_week int,
    added_timestamp timestamp,
    event text,
    frequency counter,
    PRIMARY KEY((user, added_week), added_timestamp, event))
WITH CLUSTERING ORDER BY(added_timestamp DESC)

My thoughts are that I could save some space by using this counter and also my rows would not widen so much. I am not sure though if this could have other performance implications maintaining this counter or if there are any other reasons why this might not be a good idea?

Upvotes: 0

Views: 471

Answers (1)

xmas79
xmas79

Reputation: 5180

Why you would use a counter to save space? The C* design idiom is to use space to gain efficiency.

Back to your question, counters are very limiting on what you can do, eg must be used on their own tables where you can have as many columns as you want for the primary key, and then only counter columns. They support only increment and decrement operations, and since they only support these two operations, every query is not idempotent. If you can live with inaccuracies of the "counted" value... (over-under counting is a well known problems even if C* 2.1+ mitigated that a bit)

That means you cannot specify your event column because is not part of your primary key, so your design is not valid.

Back to your uniqueness requirements, you could use the timeuuid column type. They are time-based Type 1 UUIDs and provide a decent low collision probability. From Cassandra wiki:

A Type 1 UUID consists of the following:

  • A timestamp consisting of a count of 100-nanosecond intervals since 00:00:00.00, 15 October 1582 (the date of Gregorian reform to the Christian calendar).

  • A version (which should have a value of 1).

  • A variant (which should have a value of 2).

  • A sequence number, which can be a counter or a pseudo-random number.

  • A "node" which will be the machines MAC address (which should make the UUID unique across machines).

The challenge with a UUID is to make it be unique for multiple processes running on a single machine and multiple threads running in a single process. The Type 1 UUID as specified above does neither. On a fast machine with multiple cores it is quite possible to have a UUID generated with the same time value. This can be remedied only if the sequence number can span threads and processes, something that is quite challenging to do efficiently.

The Time Based UUID referenced compensates for these issues by:

  • Only using the normal millisecond granularity returned by System.currentTimeMillis() and adjusting it to pretend to contain 100 ns counts.

  • Incrementing the time by 1 (in a non-threadsafe manner) whenever a duplicate time value is encountered.

  • Using a pseudo-random number associated with the UUID Class for the sequence number. Incrementing the time by 1 allows multiple threads to uniquely create up to 10,000 UUIDs in the same millisecond in the same process. Using a pseudo-random number for the sequence number provides a 1 in a 16,384 chance that each UUID Class will have a unique id.

These mechanisms provide a reasonable probability that the generated UUIDs will be unique. However, the issues to be aware of are:

  • The computer is capable of generating more than 10,000 UUIDs per microsecond.

  • Applications creating UUIDs on different threads could get duplicates since the time is not incremented in a thread-safe manner.

  • More than one instance of the Class is in the VM in different Class Loaders - this will be mitigated by each Class having its own sequence number.

  • There is no guarantee that two instances of a UUID in the same or different VMs will have a different sequence number - just a reasonable probability that they will.

In practice, C* will already do what you want to do. However, if you really fear that you'll end up with duplicates then you need to do proper counting yourself, and I'd suggest you to implement that at application level.

Upvotes: 1

Related Questions