Bryden C
Bryden C

Reputation: 113

Performance Between SQL and withColumn

Suppose I create the following dataframe:

dt = pd.DataFrame(np.array([[1,5],[2,12],[4,17]]),columns=['a','b'])
df = spark.createDataFrame(dt)

I want to create a third column, c, that is the sum of these two columns. I have the following two ways to do so.

The withColumn() method in Spark:

df1 = df.withColumn('c', df.a + df.b)

Or using sql:

df.createOrReplaceTempView('mydf')
df2 = spark.sql('select *, a + b as c from mydf')

While both yield the same results, which method is computationally faster?

Also, how does sql compare to a spark user defined function?

Upvotes: 1

Views: 972

Answers (1)

pault
pault

Reputation: 43504

While both yield the same results, which method is computationally faster?

Look at the execution plans:

df1.explain()
#== Physical Plan ==
#*(1) Project [a#0L, b#1L, (a#0L + b#1L) AS c#4L]
#+- Scan ExistingRDD[a#0L,b#1L]

df2.explain()
#== Physical Plan ==
#*(1) Project [a#0L, b#1L, (a#0L + b#1L) AS c#8L]
#+- Scan ExistingRDD[a#0L,b#1L]

Since these are the same, the two methods are identical.

Generally speaking, there is no computational advantage of using either withColumn or spark-sql over the other. If the code is written properly, the underlying computations will be identical.

There may be some cases where it's easier to express something using spark-sql, for example if you wanted to use a column value as a parameter to a spark function.

Also, how does sql compare to a spark user defined function?

Take a look at this post: Spark functions vs UDF performance?

Upvotes: 2

Related Questions