Reputation: 332
I have an array of arrays. Example:
[
[a, b, c],
[b, d],
[a, c],
[b],
[a, b, c, d]
]
In practice, inner arrays will be relatively short, but the outer array will be a few thousand rows.
Now, I have a reference array. Example:
[a, b, c]
The challenge is to determine which rows of the first array are completely contained in the reference array. For this example, I would want rows 1, 3, and 4 returned to me after this check.
I can do this using a nested loop pretty easily, but I was wondering if there was a clever algorithm that would be more scalable.
Recap: I need the row numbers that are some subset of the reference array. Row elements are not necessarily unique. EDIT: There are about 200 unique elements that can be inside each array. The array of arrays will be constant, but I will need to perform this same search for many different reference arrays
Upvotes: 1
Views: 83
Reputation: 4408
If you need to do this operation several times for the same array of arrays, and the inner arrays besides being short have a non-extensive set of possible elements, and they never have repeat elements (thanks @Kunukn), you can create an alternate structure turning your inner arrays into numbers and doing a "XOR" and "OR" between the element to be found and the array. That way, you'll loop your array only once. What's more, you may sort your array of arrays by value, which might be useful for other purposes. Example: a is 1, b is 2, c is 4, d is 8. Your sample array of arrays become:
[
[a, b, c], // 1 + 2 + 4 == 7
[b, d], // 2 + 8 == 10
[a, c], // 1 + 4 == 5
[b], // 2 == 2
[a, b, c, d] // 1 + 2 + 4 + 8 == 15
]
Then, your reference array [a, b, c]
is calculated into 7, the same way as the 1st element in the above array of arrays. Now, to search the element. Assuming a 4-bit integer to represent them, you'd have:
7 XOR 7 => 1110 XOR 1110 == 0000; 0000 OR 1110 == 1110 => equal to 7, found
10 XOR 7 => 0101 XOR 1110 == 1011; 1011 OR 1110 == 1111 => not equal to 7, not found
5 XOR 7 => 1010 XOR 1110 == 0100; 0100 OR 1110 == 1110 => equal to 7, found
2 XOR 7 => 0100 XOR 1110 == 1010; 1010 OR 1110 == 1110 => equal to 7, found
15 XOR 7 => 1111 XOR 1110 == 0001; 0001 OR 1110 == 1111 => not equal to 7, not found
EDIT: even though you won't have an integer that uses 200 bits, you can still use a 200-bit structure and do the same operations, it will just be a bit more of work for you.
Upvotes: 0
Reputation: 2236
You could use subset matching. I assume your dataset are sorted.
Build the subset possibilities from the reference array. Best if the reference array is small, else the dataset to be build and compared would be to big.
E.g. the sorted subset matching of the reference array [a, b, c]
gives
[a], [b], [c], [a,b], [a,c], [b,c], [a,b,c]
From the dataset
[
[a, b, c],
[b, d],
[a, c],
[a, c],
[b],
[a, b, c, d]
]
Build a hash table of the data with their position in an array
hash table
"abc", [1]
"bd", [2]
"ac", [3,4]
"b", [5]
"abcd", [6]
reference array hash
hashReference = {"a","b","c","ab", "ac," "bc", "abc"}
Algorithm
foreach key in hashRefence
lookup the hash table
This returns [5], [3,4], [1]
Upvotes: 0
Reputation: 4661
If you store your reference array as a structure where you can check for membership of an element in O(1) time (e.g., using a hash table or just a 0/1 membership indicator array since the universe is small), then you can store your collection of arrays as a trie where each array is sorted, and traverse the trie finding paths to leaves that are completely contained in your reference array. This can reduce complexity because you'll be traversing simultaneously for a subcollection of arrays using single comparisons higher up in the trie when the arrays in your subcollection overlap, and you'll be skipping subsets of arrays simultaneously when they contain a common element that is not in your reference array.
Upvotes: 2