Reputation: 27594
I have a set A
comprised of sets, and need to determine whether each set in A equals or is a subset of any other set in A
.
My instinct is to proceed as follows: Take each set i
in A
, sort then join its values with some reserved character, and determine whether that composite key exists in a hashmap. If not, add the composite key to the hashmap. Then for each combination of members in i
, also sort and join those values into a composite key and insert the key into the hashmap. Then proceed to the next set in A
.
The trouble with this approach is the space requirements are huge, as I have ~.25 million sets in A
and some have many members. I wanted to accomplish the above in main memory but can't in 16GB of RAM.
Is there a way to approach this task that's more space efficient? I'd be very grateful for any insight others can offer on this question.
Upvotes: 0
Views: 63
Reputation: 3745
Depending on how many distinct elements you have, an inverted index might make sense.
The basic idea is that for each element e
you build a list of set IDs of sets that contain e
. Then for each set i you intersect the lists (this can be optimized e.g., by sorting the set Ids) of all elements in i to get all sets where all elements of i are contained.
Example:
set 1: A, C
set 2: B, C, E
set 3: A, C, E
Inverted Index:
A -> 1, 3
B -> 2
C -> 1, 2, 3
E -> 2, 3
Then for set 1 you query the inverted index for A & C
and intersect the lists which yields 1 and 3
after removing set 1 (as a self-hit) you end up with set 3 which contains set 1. Proceed with the other sets.
Libraries like Apache Lucene or Elastic Search support this idea efficiently. There might also be in-memory inverted indices that can do this in RAM.
Upvotes: 1