Reputation: 132
I'm performing a batch process using Spark with Scala. Each day, I need to import a sales file into a Spark dataframe and perform some transformations. ( a file with the same schema, only the date and the sales values may change) At the end of the week, I need to use all daily transformations to perform weekly aggregations. Consequently, I need to persist the daily transformations so that I don't let Spark do everything at the end of the week. ( I want to avoid importing all data and performing all transformations at the end of the week). I would like also to have a solution that supports incremental updates ( upserts). I went through some options like Dataframe.persist(StorageLevel.DISK_ONLY). I would like to know if there are better options like maybe using Hive tables ? What are your suggestions on that ? What are the advantages of using Hive tables over Dataframe.persist ? Many thanks in advance.
Upvotes: 0
Views: 813
Reputation: 2938
You can save results of your daily transformations in a parquet (or orc) format, partitioned by day. Then you can run your weekly process on this parquet file with a query that filters only the data for last week. Predicate pushdown and partitioning works pretty efficiently in Spark to load only the data selected by the filter for further processing.
dataframe
.write
.mode(SaveMode.Append)
.partitionBy("day") // assuming you have a day column in your DF
.parquet(parquetFilePath)
SaveMode.Append option allows you to incrementally add data to parquet files (vs overwriting it using SaveMode.Overwrite)
Upvotes: 2