Reputation: 23
I have the following code:
Dataset<Row> rows = sparkContext.sql ("select from hive tables with multiple joins");
rows.saveAsTable(writing to another external table in hive immediately);
1) In the above case when saveAsTable()
is invoked, will spark load the whole dataset into memory?
1.1) If yes, then how do we handle the scenario when this query can actually return huge volume of data which cannot fit into the memory?
2) When spark starts executing saveAsTable()
to write data to the external Hive table when the server crashes, is there a possibility of partial data be written to the target Hive table?
2.2) If yes, how do we avoid incomplete/partial data being persisted into target hive tables?
Upvotes: 1
Views: 1593
Reputation: 26
Yes spark will place all data in memory but use parallel processes. But when we write data it will use driver memory to store the data before write. So try increasing driver memory.
so there are couple of options you have. If you have memory in clustor you can increase num-cores, num-executors, executor-memory along with driver-memory based on data size.
If you cannot fit all data in memory break the data and process in a loop programatically.
Lets say source data is partitioned by date and you have 10 days to process. try to process 1 day at a time and write to a staging dataframe. Then create partition based on date in final table and overwrite date everytime in loop.
Upvotes: 1