Reputation: 765
I have rdd loaded from Parquet file using SparkSQL
data_rdd = sqlContext.read.parquet(filename).rdd
I have noticed that actual reading from file operation gets executed once there is some aggregation function triggering spark job.
I need to measure computation time of the job without time it takes to read data from file. (i.e. same as input rdd(dataframe) is already there because it was created from sparkSQL)
Is there any function that triggers loading of file on executors memory?
I have tried .cache()
but seems like it's still triggering reading operation as part of its job.
Upvotes: 1
Views: 6640
Reputation: 10076
Spark is lazy and will only do the computations it needs.
You can .cache()
it then .count()
all the lines:
data_rdd = sqlContext.read.parquet(filename).rdd
data_rdd.cache()
data_rdd.count()
Any set of computations that follow will start from the cached state of data_rdd
since we read the whole table using count()
.
Upvotes: 2