Cursor based pagination timestamp

Question

I have to implement cursor-based pagination and am a bit confused on how to go about this given that the primary key of my entities is not an auto-increment, like for example Aerospike.

The most obvious alternative when comparison operator isn't available on the primary key in a distributed system where we don't use auto-increments, would be the use of a timestamp. But how reliable is this?

That is, two users may make an upload at exactly the same time, which basically screws up the logic behind cursor based pagination.
For example, give me the next 10 items as from a certain timestamp that was sent as cursor for fetching the next results. When this timestamp has two posts, 1 post may be dropped and neglected if it didn't fit in the previous requested count range (e.g. 10 posts of which the duplicate post would be at location 11).

How do you circumvent this problem?
The most obvious way would be to have a secondary field next to a timestamp with additional counter when a timestamp already exists, and handle the additional logic at application level, but all of this seems to add a lot of bloat.

Any insight highly appreciated!

Ronen Botzer · Accepted Answer

I doubt that Twitter uses an RDBMS auto-increment row ID for this. There are services like ZooKeeper, external to the database, with which to implement a global sequence ID. Still, you may not want to have a global sequence ID, because if everyone has to ask for a sequence from the same source you're forcing everything to serialize, defeating the whole concept of distributed processing.

Time is a natural way to sequence operations, but you need to actually agree on what the time is. If different writers talk to a service that acts as a 'wall clock' they can more or less agree on time. Like you said, you don't need nanosecond precision here. A Map that has millisecond timestamps as its map-keys would allow you to do operations such as:

Get a specific number of elements starting at a specified timestamp, even if there is no such map-key, using get_by_key_rel_index_range().
Get all the elements in the interval between two timestamps, using get_by_key_interval().

To model a user's tweets you could have their IDs stored in such a Map, with the record's key being the user ID.

To model a user's timeline you could have user-timeline records (keyed by user ID) with an ordered List containing [timestamp, tweet ID, .., ..] as elements. This would allow for elements with the same timestamp to exist side by side (where a map can't have two elements with the exact same key).

The useful List operations in this case are:

Get a specific number of elements starting at a element nearest to a specified timestamp, using get_by_value_rel_rank_range()
Get all the elements in the interval between two timestamps, using get_by_value_interavl.

See Element Ordering and Comparison.

I have examples for how List and Map operations can be used to model different things here: rbotzer/aerospike-cdt-examples.

Cursor based pagination timestamp

Answers (2)

Related Questions