Reputation: 115
Lets say we create an RDD from alluxio memory
rdd1 = sc.textFile("alluxio://.../file1.txt")
rdd2 = rdd1.map(...)
Does rdd2
reside on alluxio
or on spark
's heap.
Also would an operation like (both pairRDD's on alluxio)
pairRDD1.join(pairRDD2)
create a new RDD on alluxio or on spark heap.
The reason for the second question is that I need to join 2 large RDD's both on alluxio. Would the join use alluxio's memory or would the RDD's get pulled into spark memory for the join (and where would the resulting RDD reside).
Upvotes: 0
Views: 372
Reputation: 231
Spark transformations are evaluated in a lazy fashion. That means map()
will not be evaluated until a result is required, and will not consume any Spark memory. An RDD will only consume Spark memory if you explicitly call cache()
on the RDD.
Therefore, when you are joining 2 RDDs from Alluxio, only the source data of the RDDs will be memory, in Alluxio. During the join, Spark will use the memory required to execute the join.
Where the resulting RDD resides depends on what you are doing with that RDD. If you are writing the resulting RDD out to a file, that RDD will not be fully materialized in Spark memory, but will be written out to the file. If that file is in Alluxio, it would be in Alluxio memory, and not Spark memory. The resulting RDD will only be in Spark memory if you explicitly call cache()
.
Upvotes: 2