alexz00
alexz00

Reputation: 163

Spark structured streaming query very slow on Windows

I'm experiencing the following issue: I've a simple (Py)Spark application running a structured streaming query reading data from a custom data source; I did two series of tests, one writing the stream to CSV and the other one just printing to console; I ran the same on a Windows 10 host and on a Linux one - same JVM and Python version, similar hw configuration.

The outcome is that the application takes 15s to process all the data on Linux and 5minutes (console) to 9minutes (CSV) on Windows.

My feeling is that on Windows Spark is spending a lot of time in IO, in particular in writing the checkpoints of the query.

Has anyone experienced the same issue? Do you have any suggestion about how to improve the performances on Windows?

I'm using Spark 2.4.7.

Thanks, Alessandro

Upvotes: 1

Views: 593

Answers (1)

user2017
user2017

Reputation: 464

Aggregation on Dataframe(10 rows) in spark streaming on windows was taking around 5 mins. After reducing spark.sql.shuffle.partitions from default(200) to 1, latency improved from 5 mins to 1-2 sec.

Upvotes: 1

Related Questions