Reputation: 1307
I am reading around 500GB of data from HDFS, performing aggregations and creating an agg_Master_table
table which is the output of sqlContext.sql("....")
query
I need to use this agg_Master_table
table for further queries hence i created temp table using:
agg_master_table.createOrReplaceTempView("AggMasterTable")
But when I run the further queries on top of UserAggMasterTable
it is reading data from HDFS again, I don't want this to happen hence I am using:
sqlContext.sql("CACHE TABLE AggMasterTableCache").collect()
so that data can be stored in memory and further queries can be resulted out quickly, but now I am not able to do
AggMasterTableCache.show()
or use it in sqlContext.sql("Select * from AggMasterTableCache")
How do we make use of cache table here.
Upvotes: 1
Views: 1419
Reputation: 164
Once you create a temporary view in spark, you can cache it using the following code. When you check spark UI, you can see that after first read, it's basically not reading from HDFS again.
spark.catalog.cacheTable("AggMasterTableCache")
Note: You have basically cached the temporary view not the dataframe so any tranformation/action on top of the dataframe, it would result in reading from the source again i believe.
Upvotes: 0
Reputation: 121
Adding agg_master_table.persist()
before first calculation should do the trick.
On first calculation, data will be read from HDFS and stored, so the further reads of agg_master_table data frame will use the stored data
Upvotes: 1