Spark DataFrame Cache Large TempTable

Question

I have a spark application with a very large Dataframe. I am currently registering the dataframe as a tempTable so I can perform several queries against it.

When I am using RDDs I use persist(StorageLevel.MEMORY_AND_DISK()) what is the equivalent for a tempTable.

Below are two possibilities, I don't think option 2 will work because cacheTable tries to cache in memory and my table is too big to fit in memory.

    DataFrame standardLocationRecords = inputReader.readAsDataFrame(sc, sqlc);

    // Option 1 //
    standardLocationRecords.persist(StorageLevel.MEMORY_AND_DISK());
    standardLocationRecords.registerTempTable("standardlocationrecords");

    // Option 2 //
    standardLocationRecords.registerTempTable("standardlocationrecords");
    sqlc.cacheTable("standardlocationrecords");

How can I best cache my temptable so I can perform several queries against it without having to keep reloading the data.

Thanks, Nathan

Spark DataFrame Cache Large TempTable

Answers (1)

Related Questions