Spark joins- save as dataframes or partitioned hive tables

Question

I’m working on a project with test data close to 1 million records and 4 such files . The task is to perform around 40 calculations joining the data from 4 different files each close to 1gb .

Currently, I save the data from each into a spark table using saveastable and perform operations . For e.g. - table1 joins with table2 and the results are saved to table3 . Table3(result of 1 and 2 ) joins with table4 and so on . Finally I’m saving these calculations on a different table and generating the reports.

The entire process takes around 20 minutes and my concern is when this code gets to the production with data probably 5 times more than this , will there be performance issues .

Or is it better to save those data from each file in a partitioned way and then perform the joins and arrive to the final resultset .

P.S - The objective is to get instant results and there might be cases where the user is updating a few rows from the file and expecting an instant result. And the data is on a monthly basis , basically once every month with categories and sub-categories within .

Rony · Accepted Answer

What you are doing is just fine, but make sure to cache + count after every resource extensive operations instead of writing all the joins and then save at last step.

If you do not cache in between, spark will run entire DAG from top to bottom at the last step , it may cause JVM to overflow and spill to disk during operations which may in turn affect the execution time.

Spark joins- save as dataframes or partitioned hive tables

Answers (1)

Related Questions