toofrellik
toofrellik

Reputation: 1307

How to use cache table for further queries in spark scala

I am reading around 500GB of data from HDFS, performing aggregations and creating an agg_Master_table table which is the output of sqlContext.sql("....") query

I need to use this agg_Master_table table for further queries hence i created temp table using:

agg_master_table.createOrReplaceTempView("AggMasterTable")

But when I run the further queries on top of UserAggMasterTable it is reading data from HDFS again, I don't want this to happen hence I am using:

sqlContext.sql("CACHE TABLE AggMasterTableCache").collect()

so that data can be stored in memory and further queries can be resulted out quickly, but now I am not able to do AggMasterTableCache.show() or use it in sqlContext.sql("Select * from AggMasterTableCache")

How do we make use of cache table here.

Upvotes: 1

Views: 1419

Answers (2)

Goldie
Goldie

Reputation: 164

Once you create a temporary view in spark, you can cache it using the following code. When you check spark UI, you can see that after first read, it's basically not reading from HDFS again.

spark.catalog.cacheTable("AggMasterTableCache")

Note: You have basically cached the temporary view not the dataframe so any tranformation/action on top of the dataframe, it would result in reading from the source again i believe.

Upvotes: 0

Vapira
Vapira

Reputation: 121

Adding agg_master_table.persist() before first calculation should do the trick. On first calculation, data will be read from HDFS and stored, so the further reads of agg_master_table data frame will use the stored data

Upvotes: 1

Related Questions