Krzysztof Atłasik
Krzysztof Atłasik

Reputation: 22595

Do Spark DataFrames/Datasets share data when cached?

Let's say I do something like this:

def readDataset: Dataset[Row] = ???

val ds1 = readDataset.cache();

val ds2 = ds1.withColumn("new", lit(1)).cache();

Will ds2 and ds1 share all data in columns except "new" added to ds2? If I cache both datasets will it store in memory whole datasets ds and ds2 or will shared data be stored only once?

If data is shared, then when this sharing is broken (so the same data is stored in two memory locations)?

I know that datasets and rdds are immutable, but I couldn't find clear anwsers if the share data or not.

Upvotes: 4

Views: 82

Answers (1)

ebonnal
ebonnal

Reputation: 1167

In short: the cached data will not be shared.

Experimental proof to convince you, with code snippet and corresponding memory usage that can be found in Spark UI:

val df = spark.range(10000000).cache()
val df2 = df.withColumn("other", col("id")*3)
df2.count()

uses about 10MB of memory:

enter image description here

while

val df = spark.range(10000000).cache()
val df2 = df.withColumn("other", col("id")*3).cache()
df2.count()

uses about 30MB:

  • for df: 10MB
  • for df2: 10MB for the copied column and another 10MB the new one:

enter image description here

Upvotes: 5

Related Questions