Matthew Buxbaum
Matthew Buxbaum

Reputation: 123

How to cache a Spark data frame and reference it in another script

Is it possible to cache a data frame and then reference (query) it in another script?...My goal is as follows:

  1. In script 1, create a data frame (df)
  2. Run script 1 and cache df
  3. In script 2, query data in df

Upvotes: 12

Views: 4084

Answers (2)

zero323
zero323

Reputation: 330093

It is not possible using standard Spark binaries. Spark DataFrame is bound to the specific SQLContext which has been used to create it and is not accessible outside it.

There are tools, like for example Apache Zeppelin or Databricks, which use shared context injected into different sessions. This is way you can share temporary tables between different sessions and or guest languages.

There are other platforms, including spark-jobserver and Apache Ignite, which provide alternative ways to share distributed data structures. You can also take a look at the Livy server.

See also: Share SparkContext between Java and R Apps under the same Master

Upvotes: 7

ThatDataGuy
ThatDataGuy

Reputation: 2109

You could also persist the actual data to a file / database and load it up again. Spark provides methods to do this so you don't need to collect the data to the driver.

Upvotes: 0

Related Questions