Reputation: 874
So to set this up, I have a company in which we have users and a set of tags to describe these users. Each user can have up to 5000 tags attached.
We have an engine that allows clients to pick certain tags to make a tag group. The engine has AND/Or functionality and Include/Exclude. Clients can create a tag group and our engine finds the total number of users that meet the logical requirements specified in the tag group. Basically this is just intersections, unions, and excludes so redis sets have been perfect.
To handle this, I store the data as such. Tag1:[user1, user2,user3] Tag2:[user1, user5, user6] etc
From here, all of the bool logic is done using scripts.
However our customer base is expanding rapidly. Within a couple years, we will either need several 64GB redis servers or an alternative.
Here is my question. Are there any lightning fast DB options for doing intersect and union that are disk based? I have tried Postgres, but the performance is unacceptable. For example, a set compare on a 500k user set takes 1 second. In Postgres, I was seeing around 30 seconds, more if there are lots of tags in the tag group.
I have had DynamoDB recommended and a few others but just wanted some educated opinions before I dig too deep.
Thanks, Dan
Upvotes: 1
Views: 2585
Reputation: 748
Sorry to comment on such an old question.
I'm sure the speeds would not be as low as with redis, but I wanted to mention 2 postgres features that are in the ballpark of 'tags' and 'groups of tags'
Ltree is a handy syntax for creating a hierarchy of categories: ( supports full text search ) http://www.postgresql.org/docs/9.1/static/ltree.html
and ( I have not used this ) hstore is a tag implementation http://www.postgresql.org/docs/9.0/static/hstore.html
I believe that if you are clever about how you use these tools ( and build the correct indexes ) you should be able to get a query time down to a reasonable value.
Upvotes: 0
Reputation: 73306
"Lightning fast DB" and "disk based" are not really compatible. The fastest stores are in-memory stores.
On top of using intset, another possible optimization is to represent the sets as bitmaps. It all depends on the cardinality of the data, but supposing the number of users will grow faster than the number of tags, it may be interesting to have one bitmap per tag. In the bitmap, a given bit is indexed by the numeric ID of a user.
Redis 2.6 supports SETBIT, BITOP and BITCOUNT operations precisely for this purpose.
With one bit per user, 500K users take less than 64K, to be multiplied by the global number of tags. I suspect you will find it is even more compact than using intset.
Upvotes: 3
Reputation: 31528
Redis is the best way to get fast intersections and unions. You can do a few things to limit the memory used by Redis :
Internally, Redis uses a data structure IntSets
. This is a sorted array of integers. To find an integer in this set, the complexity is O(log N). An IntSet comes in three flavours - 16 bit, 32 bit and 64 bit.
From a memory perspective, Int Sets are very optimal. If you are using sets and care about memory, you should make sure you are using Int Sets.
To take advantage of Int Sets, you need to do two things -
set-max-intset-entries
to a reasonable number. This would be the maximum number of users for a given tag. Note that increasing it beyond a point can actually degrade performance..The sets only need user ids, they don't need the entire user object. So, if memory becomes a constraint, you can also move User objects to another data store. Perhaps another Redis server, or even a relational database. This approach gives you best of both worlds.
Upvotes: 6