Reputation: 59
I am working on big dataset which is has around 6000 million records, I have performed all calculation/operation successfully. At end while I am going to store data to databricks(DBFS) Database using below command, It is taking longer time(more than 25-30 hrs), even it is not completing also. Can someone give me some good approach to handle such huge data.
df_matches_ml_target.write.mode("overwrite").saveAsTable("Demand_Supply_Match_ML")
Let me know if you need more information on this.
Upvotes: 1
Views: 1842
Reputation: 1353
Checkpoints will help. View the execution plan.
According to Documentation:
Checkpointing can be used to truncate the logical plan of this DataFrame, which is especially useful in iterative algorithms where the plan may grow exponentially. It will be saved to files inside the checkpoint directory set with SparkContext.setCheckpointDir().
Hope this helps
Upvotes: 0
Reputation: 187
It sounds like up until this point, as Bi Rico pointed out above, you've been executing "lazy" operations on your data set. Here's a detailed summary of what lazy execution means.
Essentially, any transformations you do to your data set (such as map, flatMap, filter, etc.) will not execute until an action is called. An action does something that requires using a result and some examples are writing to a file (saveAsTable), count(), take(), etc.
Since you have 6000 Million records of an unknown size, it sounds like your data set is rather large, and that is likely a huge factor in why its taking so long to execute actions.
When using Spark with Big Data, the general recommendation is to work on a smaller subset of your data. This allows you to check the validity of your transformations and code, and get results in a reasonable amount of time. Then you can apply your work to the entire data set.
Edit on 21 Sep 2018: Recommendations for faster processing times
It's hard to say without more information but here are some general tips.
Upvotes: 2