Reputation:
inputRDD = sc.textFile("log.txt")
errorsRDD = inputRDD.filter(lambda x: "error" in x)
warningsRDD = inputRDD.filter(lambda x: "warning" in x)
badLinesRDD = errorsRDD.union(warningsRDD)
badLinesCount = badLinesRDD.count()
warningCount = warningsRDD.count()
In the code above none of the transformations are evaluated until the second to last line of code is executed where you count the number of objects in the badLinesRDD. So when this badLinesRDD.count()
is run it will compute the first four RDD's up till the union and return you the result. But when warningsRDD.count()
is run it will only compute the transformation RDD's until the top 3 lines and return you a result correct?
Also when these RDD transformations are computed when an action is called on them where are the objects from the last RDD transformation, which is union, stored? Does it get stored in memory on the each of the DataNodes where the filter transformation was run in parallel?
Upvotes: 0
Views: 380
Reputation: 330203
Unless task output is persisted explicitly (cache
, persist
for example) or implicitly (shuffle write) and there is enough free space every action will execute complete lineage.
So when you call warningsRDD.count()
it will load the file (sc.textFile("log.txt")
) and filter (inputRDD.filter(lambda x: "warning" in x)
).
Also when these RDD transformations are computed when an action is called on them where are the objects from the last RDD transformation, which is union, stored?
Assuming data is not persisted, nowhere. All task outputs are discarded after data is passed to the next stage or output. If data is persisted it depends on the settings (disk, on-heap, off-heap, DFS).
Upvotes: 2