Guigui
Guigui

Reputation: 1115

Datamodeling for Aerospike

I am doing an investigation on Aerospike. We have a need to use it as a cache for data (no need for persistance) as those data just live for a very short period of time. (We create it, we read it and then the goal is to try to delete it as fast as possible based on some processing on a service)

Our data look something like this :

Record :
- RecordId
- ClientId
- Partition
- Region
- Size
- May have X number of custom attributes (I will probably limit the number of the attributes)

ClientId here represent the multitenancy we want to implement. We will always only query records that belong to one specific ClientId.

We need to query those data on different fields. I know that this is not easy for Aerospike as it only supports one filter on a secondary index per query. As we need to support an important number of records (in the range of several millions probably) we want to partitions our records based on their Partition field. That should allow the queries to run faster and make post processing easier.

Each record would have the same format by Partition but maybe be different from one partition to another.

To solve this problem I want to model my data in Aerospike like this :

Sets :

Partition_{ClientId} : (string equality filter)
   Key : RecordId
   Bin : Partition
   Index : Partition

Region_{ClientId} (string equality filter)
   Key : RecordId
   Bin : Region
   Index : Region

Size_{ClientId} (integer range search)
   Key : RecordId
   Bin : Size
   Index : Size

With as many sets necessary to filter my data. The point of having Then we would query the different sets and realize an intersection of the results of the queries to get the filtered queries.

First Question, I am doing this because from what I read there is no easy way to filter a set based on several filter. Is this a correct assumption

Second Question, based on that model we would reach the limit of set in one namespace much faster. Is there any other way to model the same sort of data while still being efficient ?

Upvotes: 0

Views: 115

Answers (1)

pgupta
pgupta

Reputation: 5415

You can have max 1023 sets and define max 256 Secondary Indexes. If number of partitions are limited (under 1023), use that as the Secondary Index. SIs are built in process RAM and give the advantage of faster first grouping of eligible records for your query. And then filter using Expressions on ClientID, whatever other conditions. The records have metadata - expiration time (TTL) in your case, or could be the LastUpdateTime of the record (or None of these) - if you can first filter on metadata which can give a definitive GO/NOGO - metadata is in RAM (assuming Community Edition) - so that is fast - it will save reading the record from disk for the other bin values related filtering. Bin data is on disk - assuming you are using storage-engine device. If this is a cache and you are using storage-engine memory, then bin data retrieval will also be faster.

So, you can execute queries like this: For PartitionId==220, give me all records for ClientID=3005 where remaining life (TTL) is greater than 3600 seconds and Region=="North" and Size>300. i.e. you can build any combination of logic that evaluates to true or false on the record metadata and/or the bin values or bin values only. For this example query, you only need SI on PartitionId.

Upvotes: 2

Related Questions