Reputation: 1050
I have a large list in JavaPairRDD<Integer, List<String>>
and I want to do a flatMap to get all possible combinations of list entries so that I end up with JavaPairRDD<Integer, Tuple2<String,String>>
. Basically if i have something like
(1, ["A", "B", "C"])
I want to get:
(1, <"A","B">)
(1, <"A", "C">)
(1, <"B", "C")
The problem is with large lists as what I have done is created a large list of Tuple2 objects by having a nested loop over the input list. Sometimes this list does not fit in memory. I found this, but not sure how to implement it in Java: Spark FlatMap function for huge lists
Upvotes: 1
Views: 1011
Reputation: 1214
depends on how big of your datasets, in my job it usually have to process 100-200GB datasets used the FlatMap and flatMapToPair both is works properly for high intensive computation. example below
JavaPairRDD<Integer, List<String>>= DatasetsRDD.
.flatMapToPair(x->{
return xx;
});
Also if your datasets are huge you could try to use spark persistance to disk
Storage Level
MEMORY_ONLY
MEMORY_ONLY_SER
MEMORY_AND_DISK_SER
DISK_ONLY
MEMORY_ONLY_2
References: https://spark.apache.org/docs/latest/rdd-programming-guide.html
Upvotes: 1
Reputation: 53809
You may want to flatMap
the list and then join the RDD
on itself before filtering equal values:
JavaPairRDD<Integer, List<String>> original = // ...
JavaPairRDD<Integer, String> flattened = original.flatMapValues(identity());
JavaPairRDD<Integer, Tuple2<String, String>> joined = flattened.join(flattened);
JavaPairRDD<Integer, Tuple2<String, String>> filtered =
joined.filter(new Function<Tuple2<Integer, Tuple2<String, String>>, Boolean> () {
@Override
public Boolean call(Tuple2<Integer, Tuple2<String, String>> kv) throws Exception {
return kv._2()._1().equals(kv._2()._2());
}
});
Upvotes: 2