Reputation: 163
I'm experiencing the following issue: I've a simple (Py)Spark application running a structured streaming query reading data from a custom data source; I did two series of tests, one writing the stream to CSV and the other one just printing to console; I ran the same on a Windows 10 host and on a Linux one - same JVM and Python version, similar hw configuration.
The outcome is that the application takes 15s to process all the data on Linux and 5minutes (console) to 9minutes (CSV) on Windows.
My feeling is that on Windows Spark is spending a lot of time in IO, in particular in writing the checkpoints of the query.
Has anyone experienced the same issue? Do you have any suggestion about how to improve the performances on Windows?
I'm using Spark 2.4.7.
Thanks, Alessandro
Upvotes: 1
Views: 593
Reputation: 464
Aggregation on Dataframe(10 rows) in spark streaming on windows was taking around 5 mins. After reducing spark.sql.shuffle.partitions from default(200) to 1, latency improved from 5 mins to 1-2 sec.
Upvotes: 1