Reputation: 161

how to improve SaveAsTable performance?

Am running a spark streaming application but when I finally save it to hive it's taking more time such as 15kb data around 50 seconds for first streaming mini batch which was noticed SPARKUI SQL tab and also its increasing for every mini batch of spark streaming,

saveAsTable at NativeMethodAccessorImpl.java:0+details 
org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:358)
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
sun.reflect.NativeMethodAccessorImpl.
invoke(NativeMethodAccessorImpl.java:62) 
sun.reflect.DelegatingMethodAccessorImpl.invoke
(DelegatingMethodAccessorImpl.java:43)
java.lang.reflect.Method.invoke(Method.java:498)
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
py4j.Gateway.invoke(Gateway.java:280)
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
py4j.commands.CallCommand.execute(CallCommand.java:79)
py4j.GatewayConnection.run(GatewayConnection.java:214)
java.lang.Thread.run(Thread.java:745)

Upvotes: 2

Answers (2)

Jack_The_Ripper

Reputation: 703

Since writing data from the clusters to a file requires moving the data to the master node, a lot of shuffling is involved. Some suggestions I can provide are to tune your SparkContext in the following ways:

Use Kryo Serializer
Compress data before sending over the network
Play around with your JVM garbage collection
Increase your shuffle memory

Upvotes: 0

Neeraj Bhadani

Reputation: 3110

When we are creating a Spark DF, by default it creates 200 paritions and sometimes with small data 200 partitions may degrade the performance.

I would suggest you to reduce the number of partitions and see if that helps.

sqlContext.setConf("spark.sql.shuffle.partitions", "10")

you can use above statement to reduce the partitions to 10.

Regards,

Neeraj

Upvotes: 4

how to improve SaveAsTable performance?

Answers (2)

Related Questions