Storing data to database in PySpark (Azure - DataBricks) is very slow

I am working on big dataset which is has around 6000 million records, I have performed all calculation/operation successfully. At end while I am going to store data to databricks(DBFS) Database using below command, It is taking longer time(more than 25-30 hrs), even it is not completing also. Can someone give me some good approach to handle such huge data.

df_matches_ml_target.write.mode("overwrite").saveAsTable("Demand_Supply_Match_ML")

Let me know if you need more information on this.

Upvotes: 1

Answers (2)

Ramaraju.d

Reputation: 1353

Checkpoints will help. View the execution plan.

According to Documentation:

Checkpointing can be used to truncate the logical plan of this DataFrame, which is especially useful in iterative algorithms where the plan may grow exponentially. It will be saved to files inside the checkpoint directory set with SparkContext.setCheckpointDir().

Hope this helps

Upvotes: 0

GuavaKhan

Reputation: 187

It sounds like up until this point, as Bi Rico pointed out above, you've been executing "lazy" operations on your data set. Here's a detailed summary of what lazy execution means.

Essentially, any transformations you do to your data set (such as map, flatMap, filter, etc.) will not execute until an action is called. An action does something that requires using a result and some examples are writing to a file (saveAsTable), count(), take(), etc.

Since you have 6000 Million records of an unknown size, it sounds like your data set is rather large, and that is likely a huge factor in why its taking so long to execute actions.

When using Spark with Big Data, the general recommendation is to work on a smaller subset of your data. This allows you to check the validity of your transformations and code, and get results in a reasonable amount of time. Then you can apply your work to the entire data set.

Edit on 21 Sep 2018: Recommendations for faster processing times

It's hard to say without more information but here are some general tips.

Avoid commands that cause shuffling (such as groupByKey). Shuffling redistributes all data to their respective partitions before merging them. This leads to lot of network I/O.
Try to partition your data properly. This will maximize the parallel processing of your data
Add more nodes to your cluster and/or increase the size (CPU/Memory) of your nodes. This is not an exact science. More nodes can help together with partitioning. Only increase size of nodes if they are constrained for resources.

Upvotes: 2

Storing data to database in PySpark (Azure - DataBricks) is very slow

Answers (2)

Related Questions