How to persist data with Spark/ Scala?

Question

I'm performing a batch process using Spark with Scala. Each day, I need to import a sales file into a Spark dataframe and perform some transformations. ( a file with the same schema, only the date and the sales values may change) At the end of the week, I need to use all daily transformations to perform weekly aggregations. Consequently, I need to persist the daily transformations so that I don't let Spark do everything at the end of the week. ( I want to avoid importing all data and performing all transformations at the end of the week). I would like also to have a solution that supports incremental updates ( upserts). I went through some options like Dataframe.persist(StorageLevel.DISK_ONLY). I would like to know if there are better options like maybe using Hive tables ? What are your suggestions on that ? What are the advantages of using Hive tables over Dataframe.persist ? Many thanks in advance.

How to persist data with Spark/ Scala?

Answers (1)

Related Questions