Reputation: 6462
I'm new to Spark and I'm surprised that some results are not recomputed although I didn't (at least I didn't want to) cache them, i.e. I have to restart sbt to see the updated value.
Here is the relevant snippet of code:
val df: DataFrame = sqlContext.read.format("jdbc").options(
Map(
"url" -> "jdbc:postgresql://dbHost:5432/tests?user=simon&password=password",
"dbtable" -> "events")
).load()
val cached = df.cache()
val tariffs = cached.map(row => row.getAs[Int](2))
If I print tariffs.toDF().mean()
I get the correct average but if I change my code to :
val tariffs = cached.map(row => 0)
I don't see the new average (0)
until I restart sbt. How to avoid this behaviour?
Upvotes: 1
Views: 875
Reputation: 13528
I cannot see your entire code so I cannot answer with certainty but, if the following code produces the same output, you should file a bug report at https://issues.apache.org/jira/browse/spark
println(cached.map(row => row.getInt(2)).toDF().mean().collect(0))
println(cached.map(row => 0).toDF().mean().collect(0))
If, however, they produce different output then very likely there was a problem with your REPL session.
More generally, to remove the effects of caching, use
cached.unpersist()
Upvotes: 1