Reputation: 1710
I encountered a problem with DataFrame union in PySpark (version 2.4.3). When doing union on multiple data frames, each subsequent union is getting slower.
Similar issue has already been registered and marked as solved in Spark version 1.4: SPARK-12691.
Here is sample code:
from pyspark.sql import SparkSession
from pyspark.context import SparkContext
from pyspark.sql import functions as F
from pyspark.sql.types import *
from time import perf_counter
sc = SparkContext()
spark = SparkSession(sc)
df = sc.parallelize([(x, 2*x) for x in range(10)]).toDF()
df_all = spark.createDataFrame(sc.emptyRDD(), df.schema)
for i in range(2000):
t1 = perf_counter()
df_all = df_all.union(df)
print(f"{i} - {round(perf_counter() - t1, 3)}s")
Output:
0 - 0.036s
1 - 0.007s
2 - 0.009s
3 - 0.01s
4 - 0.009s
5 - 0.014s
6 - 0.01s
7 - 0.013s
8 - 0.015s
9 - 0.01s
...
1990 - 0.091s
1991 - 0.085s
1992 - 0.094s
1993 - 0.091s
1994 - 0.081s
1995 - 0.076s
1996 - 0.085s
1997 - 0.082s
1998 - 0.085s
1999 - 0.083s
If the issue is solved then why the above code is slowing down?
Upvotes: 1
Views: 3200
Reputation: 4189
You are doing union every time df
with itself, meaning that it becomes twice large at every iteration. At the last iteration you get 10*(2^21)=20971520 elements. So, the above timings are expected.
EDIT: you entirely changed the code. Now you are doing union every time with the original df
of 10 elements. At the last, 1999'th iteration the size will be 2000*10 = 20000 elements. So, once again, the timings are expected.
Upvotes: 1