Reputation: 22595
Let's say I do something like this:
def readDataset: Dataset[Row] = ???
val ds1 = readDataset.cache();
val ds2 = ds1.withColumn("new", lit(1)).cache();
Will ds2
and ds1
share all data in columns except "new" added to ds2
? If I cache both datasets will it store in memory whole datasets ds
and ds2
or will shared data be stored only once?
If data is shared, then when this sharing is broken (so the same data is stored in two memory locations)?
I know that datasets and rdds are immutable, but I couldn't find clear anwsers if the share data or not.
Upvotes: 4
Views: 82
Reputation: 1167
In short: the cached data will not be shared.
Experimental proof to convince you, with code snippet and corresponding memory usage that can be found in Spark UI:
val df = spark.range(10000000).cache()
val df2 = df.withColumn("other", col("id")*3)
df2.count()
uses about 10MB of memory:
while
val df = spark.range(10000000).cache()
val df2 = df.withColumn("other", col("id")*3).cache()
df2.count()
uses about 30MB:
df
: 10MBdf2
: 10MB for the copied column and another 10MB the new one:Upvotes: 5