Spark function for loading parquet file on memory

Question

I have rdd loaded from Parquet file using SparkSQL

data_rdd = sqlContext.read.parquet(filename).rdd

I have noticed that actual reading from file operation gets executed once there is some aggregation function triggering spark job.

I need to measure computation time of the job without time it takes to read data from file. (i.e. same as input rdd(dataframe) is already there because it was created from sparkSQL)

Is there any function that triggers loading of file on executors memory?

I have tried .cache() but seems like it's still triggering reading operation as part of its job.

MaFF · Accepted Answer

Spark is lazy and will only do the computations it needs. You can .cache() it then .count() all the lines:

data_rdd = sqlContext.read.parquet(filename).rdd
data_rdd.cache()
data_rdd.count()

Any set of computations that follow will start from the cached state of data_rdd since we read the whole table using count().

Spark function for loading parquet file on memory

Answers (1)

Related Questions