Reputation: 1227
The problem is: Suppose we have a group of Sets: Set(1,2,3)
Set(1,2,3,4)
Set(4,5,6)
Set(1,2,3,4,6)
, we need to delete all the subsets and finally get the Result: Set(4,5,6)
Set(1,2,3,4,6)
. (Since both Set(1,2,3)
and Set(1,2,3,4)
are the subsets of Set(1,2,3,4,6)
, both are removed.)
And suppose that the elements of the set have order
, which can be Int, Char, etc.
Is it possible to do it in a map-reduce way?
The reason to do it in a map-reduce way is that sometimes the group of Sets has a very large size, which makes it not possible to do it in the memory of a single machine. So we hope to do it in a map-reduce way, it may be not very efficient, but just work.
My problem is:
I don't know how to define a key for the key-value pair in the map-reduce process to group Sets properly.
I don't know when the process should be finished, that all the subsets have been removed.
EDIT:
The size of the data will keep growing larger in the future.
The input can be either a group of sets or multiple lines with each line containing a group of sets. Currently the input is val data = RDD[Set]
, I firstly do data.collect()
, which results in an overall group of sets. But I can modify the generation of the input into a RDD[Array[Set]]
, which will give me multiple lines with each line containing a group of sets.
The elements in each set can be sorted by modifying other parts of the program.
Upvotes: 6
Views: 635
Reputation: 3079
If I understand:
your goal is to detect and remove subset of set in a large set of sets
there are too sets to be managed altogether (memory limit)
strategy is map and reduce (or some sort of)
What I take into account:
main problem is that you can not managed everything at same time
usual method map/reduce supposes to split datas, and treat each part. This is not done totally like that (because each subset can intersect with each subset).
If I make some calculations:
Even with 100 000 (10 billions), it takes too much times (I stopped).
What I propose (test with 100 000 sets) :
1 define a criterion to split in more little compatible sets. Compatible sets are packages of sets, and you are sure subsets of sets are at least in one same package: then you are sure to find subsets to remove with that method. Say differently: if set A is a subset of set B, then A and B will reside in one (or several) packages like that.
I just take: every subset which contains one defined element (1, 2, 3, ...) => it gives approximately 11 500 sets with precedent assumptions. It become reasonable to compare (120 000 comparisons). It takes 180 seconds on my machine, and it found 900 subsets to remove.
you have to do it 100 times (then 18 000 seconds).
and of course, you can find duplicates (but not too many: some %, and the gool is to eliminate).
2 At end it is easy and fast to agglomerate. Duplicate work is light.
3 bigger filters:
with a filter with 2 elements, you reduce to 1475 sets => you get approximately 30 sets to delete, it takes 2-3 seconds
and you have to do that 10 000
Interest of this method:
the selection of sets on the criterion is linear and very simple. It is also hiearchical: split on one element , on a second, etc.
it is stateless: you can filter millions of set. You only have to keep the good one. The more datas you have, the more filter you have to do => solution is scalable.
if you want to treat little clouds, you can takes 3, 4 elements in common, etc.
like that, you can spread your treatment among multiple machines (as many as you have).
At the end, you have to reconciliate all your datas/deleting.
This solution doesnt keep a lot of time overall (you can do calculations), but it suits the need of splitting the work.
Hope it helps.
Upvotes: 0
Reputation: 7462
Since subset-of is a transitive relation (proof), you could take advantage of that and design an iterative algorithm that eliminates subsets in each iteration.
The logic is the following:
Mapper:
eliminate local subsets and emit only the supersets. Let the key be the first element of each superset.
Reducer:
eliminate local subsets and emit only the supersets.
You could also use a combiner with the same logic.
Each time, the number of reducers should decrease, until, in the last iteration, a single reducer is used. This way, you can define from the beginning the number of iterations. E.g. by setting initially 8 reducers, and each time using half of them in the next iteration, your program will terminate after 4 iterations (8 reducers, then 4, then 2 and then 1). In general, it will terminate in logn + 1 iterations (log base 2), where n is the initial number of reducers, so n should be a power of 2 and of course less than the number of mappers. If this feels restrictive, you can think of more drastic decreases in the number of reducers (e.g. decrease by 1/4, or more).
Regarding the choice of the key, this can create balancing issues, if, for example, most of the sets start with the same element. So, perhaps you could also make use of other keys, or define a partitioner to better balance the load. This policy makes sure, though, that sets that are equal will be eliminated as early as possible.
If you have MapReduce v.2, you could implement the aforementioned logic like that (pseudocode):
Mapper:
Set<Set> superSets;
setup() {
superSets = new HashSet<>();
}
map(inputSet){
Set toReplace = null;
for (Set superSet : superSets) {
if (superSet.contains(inputSet) {
return;
}
if (inputSet.contains(superSet)) {
toReplace = superSet;
break;
}
}
if (toReplace != null) {
superSets.remove(toReplace);
}
superSets.add(inputSet);
}
close() {
for (Set superSet : superSets) {
context.write(superSet.iterator.next(), superSet);
}
}
You can use the same code in the reducer and in the combiner.
As a final note, I doubt that MapReduce is the right environment for this kind of computations. Perhaps Apache Spark, or Apache Flink offer some better alternatives.
Upvotes: 0
Reputation: 6771
I doubt this can be done by a traditional map-reduce technique which is essentially a divide-and-conquer method. This is because:
A is not-a-subset-of B
and B is-not-subset-of C
, we cannot make any statement about A
w.r.t. C
.Based on the above observations this problem seems to be similar to duplicate detection and there is research on duplicate detection, for example here. Similar techniques will work well for the current problem.
Upvotes: 1