rishiehari
rishiehari

Reputation: 394

Spark flatMapToPair vs [filter + mapToPair]

What is the performance difference between the blocks of code below?

1.FlatMapToPair: This code block uses a single transformation, but is basically having the filter condition inside of it which returns an empty list, technically not allowing this element in the RDD to progress along

rdd.flatMapToPair(
    if ( <condition> )
        return Lists.newArrayList();

    return Lists.newArrayList(new Tuple2<>(key, element));
)

2.[Filter + MapToPair] This code block has two transformations where the first transformation simply filters using the same condition as the above block of code but does another transformation mapToPair after the filter.

rdd.filter(
    (element) -> <condition>
).mapToPair(
    (element) -> new Tuple2<>(key, element)
)

Is Spark intelligent enough to perform the same with both these blocks of code regardless of the number of transformation OR perform worse in the code block 2 as these are two transformations?

Thanks

Upvotes: 0

Views: 3139

Answers (1)

zero323
zero323

Reputation: 330113

Actually Spark will perform worse in the first case because it has to initialize and then garbage collect new ArrayList for each record. Over a large number of records it can add substantial overhead.

Otherwise Spark is "intelligent enough" to use lazy data structures and combines multiple transformations which don't require shuffles into a single stage.

There are some situations where explicit merging of different transformations is beneficial (either to reduce number of initialized objects or to keep shorter lineage) but this is not one of these.

Upvotes: 3

Related Questions