lord.garbage
lord.garbage

Reputation: 5960

Efficient algorithm to check for subset against a range of sets

I read through some of the posts about determining whether a set A is a subset of another set B. But I find it hard to determine what algorithm to use. Here is an outline of the problem:

Now, I thought about hashtables. However, they would, in my opinion only be efficient if the was only one B and a lot of As. Then I could make a hashtable for B and check each string array of each object against my hashtable. But this is not the case because there is only one A but n Bs. What would be an efficient algorithm to do this?

Example:

A:  ["A", "G", "T"]
B1: ["C", "G"]
B2: ["K", "A", "U", "T", "G"]
.
.
.
Bn: ["T", "I", "G", "O", "L"]

Here A is a subset of B2 but not of B1, and not of Bn.

Upvotes: 5

Views: 2198

Answers (3)

kfx
kfx

Reputation: 8537

As you know A beforehand, you can design a collision-free hash function to hash all elements of A.

Then operate only on the hashes in the search step, not the strings. For each element of B, compute its hash and then use it to look up an element of A. If an element is found, it means that the hashes match; then you'll also need to compare the strings to detect if its a true positive or just an accidental match.

Count the number of matches. When that number is equal to the size of A, stop and return a positive result. If all elements of a B have been processed and the number of matches is less than size of A, return a negative result.

Upvotes: 1

user1196549
user1196549

Reputation:

An efficient approach is to represent the set A as a trie. This allows to check if a given string belongs to A in time linear in the string length.

Then there is no better way then checking exhaustively for all Bi and all strings in Bi if it belongs to A. The search stops as soon as all strings in A have been matched (flagging a string when it has been found).

The running time will be at worse proportional to the total number of characters in all strings in all B. In practice, a significant fraction of the characters will be skipped, as

  • the search for a string not in A can terminate early,

  • the subset test can conclude positively even if there are strings left in Bi,

  • the subset test can conclude negatively when there are more unmatched strings in A than strings left in Bi.

This approach is certainly worst-case optimal as you read the characters at most once and perform a constant number of operations per character.

Upvotes: 2

CiaPan
CiaPan

Reputation: 9570

As a first approach I would pre-calculate some general properties of sets, which (hopefully) will allow you to filter some of Bs quickly. These may be, for example:

  • a number of strings – A certainly can't be a subset of B if it contains more elements than B;
  • a length of the longest string – A certainly is not a subset of B if the longest string in A is longer than the longest string in B;
  • a sum of strings' lengths.

For easier checking you might require every set to be ordered alphabetically. That will allow checking A against a single B in a (linear) scan through both sets of strings.

For small A and big B sets it might be more efficient to look for a string in B with binary search rather than with a linear scan; that would require pre-sorting of B, too.

Upvotes: 1

Related Questions