Spark RDD Lineage and Storage

Question

inputRDD = sc.textFile("log.txt")
errorsRDD = inputRDD.filter(lambda x: "error" in x) 
warningsRDD = inputRDD.filter(lambda x: "warning" in x) 
badLinesRDD = errorsRDD.union(warningsRDD)
badLinesCount = badLinesRDD.count()
warningCount = warningsRDD.count()

In the code above none of the transformations are evaluated until the second to last line of code is executed where you count the number of objects in the badLinesRDD. So when this badLinesRDD.count() is run it will compute the first four RDD's up till the union and return you the result. But when warningsRDD.count() is run it will only compute the transformation RDD's until the top 3 lines and return you a result correct?

Also when these RDD transformations are computed when an action is called on them where are the objects from the last RDD transformation, which is union, stored? Does it get stored in memory on the each of the DataNodes where the filter transformation was run in parallel?

zero323 · Accepted Answer

Unless task output is persisted explicitly (cache, persist for example) or implicitly (shuffle write) and there is enough free space every action will execute complete lineage.

So when you call warningsRDD.count() it will load the file (sc.textFile("log.txt")) and filter (inputRDD.filter(lambda x: "warning" in x)).

Also when these RDD transformations are computed when an action is called on them where are the objects from the last RDD transformation, which is union, stored?

Assuming data is not persisted, nowhere. All task outputs are discarded after data is passed to the next stage or output. If data is persisted it depends on the settings (disk, on-heap, off-heap, DFS).

Spark RDD Lineage and Storage

Answers (1)

Related Questions