ThomasVdBerge
ThomasVdBerge

Reputation: 8140

Range Key Querying on composed keys

Currently I have a collection which contains the following fields:

For my Dynamo collection I used userId as the hashKey and for the rangeKey I wanted to use date:otherUserId. By doing it like this I could retrieve all userId entries sorted on a date which is good.

However, for my usecase I shouldn't have any duplicates, meaning I shouldn't have the same userId-otherUserId value in my collection. This means I should do a query first to check if that 'couple' exist, remove it if needed and then do the insert, right?

EDIT:

Thanks for your help already :-)

The goal of my usecase would be to store when userA visits the profile of userB.

Now, The kind of queries I would like to do are the following:

Upvotes: 2

Views: 437

Answers (1)

oDDsKooL
oDDsKooL

Reputation: 1837

I think you have a lot of choices, but here is one that might work based on the assumption that your application is time-aware i.e. you want to query for interactions in the last N minutes, hours, days etc.

hash_key = userA
range_key = [iso1860_timestamp][1]+userB+uuid

First, the uuid trick is there to avoid overwriting a record of an interaction between userA and userB happening exactly at the same time (can occur depending on the granularity/precision of your clock). So insert-wise we are safe : no duplicates, no overwrites.

Query-wise, here is how things get done:

  • Retrieve all the UserB's that visited the profile of UserA, in an unique (= No double UserB's) and sorted by time way.

query(hash_key=userA, range_key_condition=BEGIN(common_prefix))

where common_prefix = 2013-01-01 for all interactions in Jan 2013

This will retrieve all records in a time range, sorted (assuming they were inserted in the proper order). Then in the application code you filter them to retain only those concerning userB. Unfortunately, DynamoDB API doesn't support a list of range key conditions (otherwise you could just save some time by passing an additional CONTAINS userB condition).

  • Retrieve a particular pair visit of UserA and UserB

query(hash_key=userA, range_key_condition=BEGINS(common_prefix))

where common_prefix could be much more precise if we can assume you know the timestamp of the interaction.

Of course, this design should be evaluated wrt to the properties of the data stream you will handle. If you can (most often) specify a meaningful time range for your queries, it will be fast and bounded by the number of interactions you have recorded in the time range for userA.

If your application is not so time-oriented - and we can assume a user have most often only a few interactions - you might switch to the following schema:

hash_key = userA
range_key = userB+[iso1860_timestamp][1]+uuid

This way you can query by user:

query(hash_key=userA, range_key_condition=BEGIN(userB))

This alternative will be fast and bounded by the nber of userA - userB interactions over all time ranges, which could be meaningful depending on your application.

So basically you should check example data and estimate which orientation is meaningful for your application. Both orientations (time or user) might also be sped up by manually creating and maintaining indexes in other tables - at the cost of a more complex application code.


(historical version: trick to avoid overwriting records with time-based keys) A common trick in your case is to postfix the range key with a generated unique id (uuid). This way you can still do query calls with BETWEEN condition to retrieve records that were inserted in a given time period, and you don't need to worry about key collision at insertion time.

Upvotes: 2

Related Questions