Nath5
Nath5

Reputation: 1655

Spark DataFrame Cache Large TempTable

I have a spark application with a very large Dataframe. I am currently registering the dataframe as a tempTable so I can perform several queries against it.

When I am using RDDs I use persist(StorageLevel.MEMORY_AND_DISK()) what is the equivalent for a tempTable.

Below are two possibilities, I don't think option 2 will work because cacheTable tries to cache in memory and my table is too big to fit in memory.

    DataFrame standardLocationRecords = inputReader.readAsDataFrame(sc, sqlc);

    // Option 1 //
    standardLocationRecords.persist(StorageLevel.MEMORY_AND_DISK());
    standardLocationRecords.registerTempTable("standardlocationrecords");

    // Option 2 //
    standardLocationRecords.registerTempTable("standardlocationrecords");
    sqlc.cacheTable("standardlocationrecords");

How can I best cache my temptable so I can perform several queries against it without having to keep reloading the data.

Thanks, Nathan

Upvotes: 2

Views: 2085

Answers (1)

radek1st
radek1st

Reputation: 1647

I've just had a look at Spark 1.6.1 source code and actually Option 2 is what you want. Here's an excerpt from a comment on caching:

... Unlike RDD.cache(), the default storage level is set to be MEMORY_AND_DISK because recomputing the in-memory columnar representation of the underlying table is expensive.

  def cacheTable(tableName: String): Unit = {
    cacheManager.cacheQuery(table(tableName), Some(tableName))
  }

  private[sql] def cacheQuery(
      query: Queryable,
      tableName: Option[String] = None,
      storageLevel: StorageLevel = MEMORY_AND_DISK): Unit 

Reference:

https://github.com/apache/spark/blob/branch-1.6/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L355

https://github.com/apache/spark/blob/branch-1.6/sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala#L76

Upvotes: 2

Related Questions