Reputation: 27078
I have a use case where I need to store a large number of entries with the unique set sizes for each one. If we simplify this to contacts (which the problem very much is not). We would have a problem like :
Given a user know how many friends they have:
Joe - Mary, John, Bob, Tom
Mary - Carol, Suzy, Mike, Fred, Robert
Thus friends(Joe) = 4
- the only operation supported is addFriend(Joe, Sam)
. While Mary might be friends with Joe there is no need to store any of that related information.
What I would rather not do is store all of the entries in each set, but yet a bloom filter doesn't quite feel right. Is there other alternatives?
Update: The challenge is that I've got 20M Joe/Mary/... in the top level sets with 4M semi-distinct members in each set. The quick code example is below (python for simplicity) - but at scale + persistent storage the universe comes to an end.
class World:
def __init__(self):
self.friends = defaultdict(set)
def addFriend(self, id, member):
self.friends[id].add(member)
def friends(self, id):
return len(friends[id])
Upvotes: 3
Views: 139
Reputation: 65458
Since you're considering a Bloom filter, it sounds as though approximate answers would be OK. Use a small-space cardinality estimator like HyperLogLog in place of self.friends
.
Upvotes: 1