Determine whether set is subset of any member in set of sets

Question

I have a set A comprised of sets, and need to determine whether each set in A equals or is a subset of any other set in A.

My instinct is to proceed as follows: Take each set i in A, sort then join its values with some reserved character, and determine whether that composite key exists in a hashmap. If not, add the composite key to the hashmap. Then for each combination of members in i, also sort and join those values into a composite key and insert the key into the hashmap. Then proceed to the next set in A.

The trouble with this approach is the space requirements are huge, as I have ~.25 million sets in A and some have many members. I wanted to accomplish the above in main memory but can't in 16GB of RAM.

Is there a way to approach this task that's more space efficient? I'd be very grateful for any insight others can offer on this question.

SaiBot · Accepted Answer

Depending on how many distinct elements you have, an inverted index might make sense.

The basic idea is that for each element e you build a list of set IDs of sets that contain e. Then for each set i you intersect the lists (this can be optimized e.g., by sorting the set Ids) of all elements in i to get all sets where all elements of i are contained.

Example:

set 1: A, C
set 2: B, C, E
set 3: A, C, E

Inverted Index:

A -> 1, 3
B -> 2
C -> 1, 2, 3
E -> 2, 3

Then for set 1 you query the inverted index for A & C and intersect the lists which yields 1 and 3 after removing set 1 (as a self-hit) you end up with set 3 which contains set 1. Proceed with the other sets.

Libraries like Apache Lucene or Elastic Search support this idea efficiently. There might also be in-memory inverted indices that can do this in RAM.

Determine whether set is subset of any member in set of sets

Answers (1)

Related Questions