Brandon Lee
Brandon Lee

Reputation: 765

Spark function for loading parquet file on memory

I have rdd loaded from Parquet file using SparkSQL

data_rdd = sqlContext.read.parquet(filename).rdd

I have noticed that actual reading from file operation gets executed once there is some aggregation function triggering spark job.

I need to measure computation time of the job without time it takes to read data from file. (i.e. same as input rdd(dataframe) is already there because it was created from sparkSQL)

Is there any function that triggers loading of file on executors memory?

I have tried .cache() but seems like it's still triggering reading operation as part of its job.

Upvotes: 1

Views: 6640

Answers (1)

MaFF
MaFF

Reputation: 10076

Spark is lazy and will only do the computations it needs. You can .cache() it then .count() all the lines:

data_rdd = sqlContext.read.parquet(filename).rdd
data_rdd.cache()
data_rdd.count()

Any set of computations that follow will start from the cached state of data_rdd since we read the whole table using count().

Upvotes: 2

Related Questions