user3078523
user3078523

Reputation: 1710

PySpark - multiple union on DataFrame getting slower

I encountered a problem with DataFrame union in PySpark (version 2.4.3). When doing union on multiple data frames, each subsequent union is getting slower.

Similar issue has already been registered and marked as solved in Spark version 1.4: SPARK-12691.

Here is sample code:

from pyspark.sql import SparkSession
from pyspark.context import SparkContext
from pyspark.sql import functions as F
from pyspark.sql.types import *
from time import perf_counter

sc = SparkContext()
spark = SparkSession(sc)

df = sc.parallelize([(x, 2*x) for x in range(10)]).toDF()
df_all = spark.createDataFrame(sc.emptyRDD(), df.schema)
for i in range(2000):
    t1 = perf_counter()
    df_all = df_all.union(df)
    print(f"{i} - {round(perf_counter() - t1, 3)}s")

Output:

0 - 0.036s                                                                      
1 - 0.007s
2 - 0.009s
3 - 0.01s
4 - 0.009s
5 - 0.014s
6 - 0.01s
7 - 0.013s
8 - 0.015s
9 - 0.01s
...
1990 - 0.091s
1991 - 0.085s
1992 - 0.094s
1993 - 0.091s
1994 - 0.081s
1995 - 0.076s
1996 - 0.085s
1997 - 0.082s
1998 - 0.085s
1999 - 0.083s

If the issue is solved then why the above code is slowing down?

Upvotes: 1

Views: 3200

Answers (1)

gudok
gudok

Reputation: 4189

You are doing union every time df with itself, meaning that it becomes twice large at every iteration. At the last iteration you get 10*(2^21)=20971520 elements. So, the above timings are expected.

EDIT: you entirely changed the code. Now you are doing union every time with the original df of 10 elements. At the last, 1999'th iteration the size will be 2000*10 = 20000 elements. So, once again, the timings are expected.

Upvotes: 1

Related Questions