Reputation: 161
Am running a spark streaming application but when I finally save it to hive it's taking more time such as 15kb data around 50 seconds for first streaming mini batch which was noticed SPARKUI SQL tab and also its increasing for every mini batch of spark streaming,
saveAsTable at NativeMethodAccessorImpl.java:0+details
org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:358)
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
sun.reflect.NativeMethodAccessorImpl.
invoke(NativeMethodAccessorImpl.java:62)
sun.reflect.DelegatingMethodAccessorImpl.invoke
(DelegatingMethodAccessorImpl.java:43)
java.lang.reflect.Method.invoke(Method.java:498)
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
py4j.Gateway.invoke(Gateway.java:280)
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
py4j.commands.CallCommand.execute(CallCommand.java:79)
py4j.GatewayConnection.run(GatewayConnection.java:214)
java.lang.Thread.run(Thread.java:745)
Upvotes: 2
Views: 4681
Reputation: 703
Since writing data from the clusters to a file requires moving the data to the master node, a lot of shuffling is involved. Some suggestions I can provide are to tune your SparkContext in the following ways:
Upvotes: 0
Reputation: 3110
When we are creating a Spark DF, by default it creates 200 paritions and sometimes with small data 200 partitions may degrade the performance.
I would suggest you to reduce the number of partitions and see if that helps.
sqlContext.setConf("spark.sql.shuffle.partitions", "10")
you can use above statement to reduce the partitions to 10.
Regards,
Neeraj
Upvotes: 4