Reputation: 1655
I have a spark application with a very large Dataframe. I am currently registering the dataframe as a tempTable so I can perform several queries against it.
When I am using RDDs I use persist(StorageLevel.MEMORY_AND_DISK()) what is the equivalent for a tempTable.
Below are two possibilities, I don't think option 2 will work because cacheTable tries to cache in memory and my table is too big to fit in memory.
DataFrame standardLocationRecords = inputReader.readAsDataFrame(sc, sqlc);
// Option 1 //
standardLocationRecords.persist(StorageLevel.MEMORY_AND_DISK());
standardLocationRecords.registerTempTable("standardlocationrecords");
// Option 2 //
standardLocationRecords.registerTempTable("standardlocationrecords");
sqlc.cacheTable("standardlocationrecords");
How can I best cache my temptable so I can perform several queries against it without having to keep reloading the data.
Thanks, Nathan
Upvotes: 2
Views: 2085
Reputation: 1647
I've just had a look at Spark 1.6.1 source code and actually Option 2 is what you want. Here's an excerpt from a comment on caching:
... Unlike
RDD.cache()
, the default storage level is set to beMEMORY_AND_DISK
because recomputing the in-memory columnar representation of the underlying table is expensive.
def cacheTable(tableName: String): Unit = {
cacheManager.cacheQuery(table(tableName), Some(tableName))
}
private[sql] def cacheQuery(
query: Queryable,
tableName: Option[String] = None,
storageLevel: StorageLevel = MEMORY_AND_DISK): Unit
Reference:
Upvotes: 2