Set sizes for large number of sets

Question

I have a use case where I need to store a large number of entries with the unique set sizes for each one. If we simplify this to contacts (which the problem very much is not). We would have a problem like :

Given a user know how many friends they have:

Joe - Mary, John, Bob, Tom

Mary - Carol, Suzy, Mike, Fred, Robert

Thus friends(Joe) = 4 - the only operation supported is addFriend(Joe, Sam). While Mary might be friends with Joe there is no need to store any of that related information.

What I would rather not do is store all of the entries in each set, but yet a bloom filter doesn't quite feel right. Is there other alternatives?

Update: The challenge is that I've got 20M Joe/Mary/... in the top level sets with 4M semi-distinct members in each set. The quick code example is below (python for simplicity) - but at scale + persistent storage the universe comes to an end.

class World:
    def __init__(self):
        self.friends = defaultdict(set)

    def addFriend(self, id, member):
        self.friends[id].add(member)

    def friends(self, id):
        return len(friends[id])

David Eisenstat · Accepted Answer

Since you're considering a Bloom filter, it sounds as though approximate answers would be OK. Use a small-space cardinality estimator like HyperLogLog in place of self.friends.

Set sizes for large number of sets

Answers (1)

Related Questions