Large list FlatMap Java Spark

Question

I have a large list in JavaPairRDD> and I want to do a flatMap to get all possible combinations of list entries so that I end up with JavaPairRDD>. Basically if i have something like

(1, ["A", "B", "C"])

I want to get:

(1, <"A","B">) (1, <"A", "C">) (1, <"B", "C")

The problem is with large lists as what I have done is created a large list of Tuple2 objects by having a nested loop over the input list. Sometimes this list does not fit in memory. I found this, but not sure how to implement it in Java: Spark FlatMap function for huge lists

Jean Logeart · Accepted Answer

You may want to flatMap the list and then join the RDD on itself before filtering equal values:

JavaPairRDD> original = // ...
JavaPairRDD flattened = original.flatMapValues(identity());
JavaPairRDD> joined = flattened.join(flattened);
JavaPairRDD> filtered = 
    joined.filter(new Function>, Boolean> () {
        @Override
        public Boolean call(Tuple2> kv) throws Exception {
            return kv._2()._1().equals(kv._2()._2());
        }
    });

Large list FlatMap Java Spark

Answers (2)

Related Questions