ggeop
ggeop

Reputation: 1375

Spark multiple CSV reads?

In my spark application I read ONCE a directory with many CSVs. But, in the DAG I see multiple CSV reads.

Spark UI Screenshot: enter image description here

Upvotes: 1

Views: 628

Answers (1)

Salim
Salim

Reputation: 2178

Spark will read them multiple times if the DataFrame is not cached.


    val df1 = spark.read.csv("path")
    val df2_result = df1.filter(.......).save(......)
    val df3_result = df1.map(....).groupBy(...).save(......)

Here df2_result and df3_result both will cause df1 to be rebuilt from csv files. To avoid this you can cache like this. DF1 will built once from csv and the 2nd time it will not be build from files.


    val df1 = spark.read.csv("path")
    df1.cache()
    val df2_result = df1.filter(.......).save(......)
    val df3_result = df1.map(....).groupBy(...).save(......)

Upvotes: 1

Related Questions