Simon
Simon

Reputation: 6462

Spark Scala: how to force Spark to recompute some results (not to use its cache)

I'm new to Spark and I'm surprised that some results are not recomputed although I didn't (at least I didn't want to) cache them, i.e. I have to restart sbt to see the updated value.

Here is the relevant snippet of code:

val df: DataFrame = sqlContext.read.format("jdbc").options(
  Map(
    "url" -> "jdbc:postgresql://dbHost:5432/tests?user=simon&password=password",
    "dbtable" -> "events")
).load()

val cached = df.cache()

val tariffs = cached.map(row => row.getAs[Int](2))

If I print tariffs.toDF().mean() I get the correct average but if I change my code to :

val tariffs = cached.map(row => 0)

I don't see the new average (0) until I restart sbt. How to avoid this behaviour?

Upvotes: 1

Views: 875

Answers (1)

Sim
Sim

Reputation: 13528

I cannot see your entire code so I cannot answer with certainty but, if the following code produces the same output, you should file a bug report at https://issues.apache.org/jira/browse/spark

println(cached.map(row => row.getInt(2)).toDF().mean().collect(0))
println(cached.map(row => 0).toDF().mean().collect(0))

If, however, they produce different output then very likely there was a problem with your REPL session.

More generally, to remove the effects of caching, use

cached.unpersist()

Upvotes: 1

Related Questions