Reputation: 18869
I have to implement cursor-based pagination and am a bit confused on how to go about this given that the primary key of my entities is not an auto-increment, like for example Aerospike.
The most obvious alternative when comparison operator isn't available on the primary key in a distributed system where we don't use auto-increments, would be the use of a timestamp. But how reliable is this?
That is, two users may make an upload at exactly the same time, which basically screws up the logic behind cursor based pagination.
For example, give me the next 10 items as from a certain timestamp that was sent as cursor for fetching the next results. When this timestamp has two posts, 1 post may be dropped and neglected if it didn't fit in the previous requested count range (e.g. 10 posts of which the duplicate post would be at location 11).
How do you circumvent this problem?
The most obvious way would be to have a secondary field next to a timestamp with additional counter when a timestamp already exists, and handle the additional logic at application level, but all of this seems to add a lot of bloat.
Any insight highly appreciated!
Upvotes: 3
Views: 2607
Reputation: 7117
I doubt that Twitter uses an RDBMS auto-increment row ID for this. There are services like ZooKeeper, external to the database, with which to implement a global sequence ID. Still, you may not want to have a global sequence ID, because if everyone has to ask for a sequence from the same source you're forcing everything to serialize, defeating the whole concept of distributed processing.
Time is a natural way to sequence operations, but you need to actually agree on what the time is. If different writers talk to a service that acts as a 'wall clock' they can more or less agree on time. Like you said, you don't need nanosecond precision here. A Map that has millisecond timestamps as its map-keys would allow you to do operations such as:
get_by_key_rel_index_range()
.get_by_key_interval()
.To model a user's tweets you could have their IDs stored in such a Map, with the record's key being the user ID.
To model a user's timeline you could have user-timeline records (keyed by user ID) with an ordered List containing [timestamp, tweet ID, .., ..]
as elements. This would allow for elements with the same timestamp to exist side by side (where a map can't have two elements with the exact same key).
The useful List operations in this case are:
get_by_value_rel_rank_range()
get_by_value_interavl
.See Element Ordering and Comparison.
I have examples for how List and Map operations can be used to model different things here: rbotzer/aerospike-cdt-examples.
Upvotes: 1
Reputation: 5415
Use Capped Lists or Capped Maps as your data bin.
Capped Map Code Snippet or a variation thereof - retains last 10 updates:
public class CappedMap {
public static int insert(AerospikeClient client, int i) {
Key key = new Key("test", "testMap", "user1");
MapPolicy mPolicy = new MapPolicy();
int retVal=0;
try {
client.operate(null, key,
MapOperation.removeByIndexRange("myMap",-10,10,MapReturnType.INVERTED),
// INVERTED introduced in server version 3.16.0.1
MapOperation.put(mPolicy, "myMap", Value.get(i),
Value.get("A quick brown fox jumps right over a lazy dog") ));
}
catch (AerospikeException e) {
System.out.println("Error Code: "+e.getResultCode());
}
return i;
}
public static void main(String[] args) {
AerospikeClient client = new AerospikeClient("127.0.0.1", 3000);
int retVal = 0;
for (int i = 0; i < 123; i++) {
System.out.println("Inserting k = "+i);
i = insert(client, i);
}
client.close();
}
}
Upvotes: 2