secfree
secfree

Reputation: 4647

What's the difference between Dataset.col() and functions.col() in Spark?

Here's some statement: https://stackoverflow.com/a/45600938/4164722

Dataset.col returns resolved column while col returns unresolved column.

Can someone provide more details? When should I use Dataset.col() and when functions.col?

Thanks.

Upvotes: 6

Views: 10634

Answers (2)

NYCeyes
NYCeyes

Reputation: 5659

EXPLANATION:

At times you may want to programmatically pre-create (i.e. ahead of time) column expressions for later use -- before the related DataFrame(s) actually exists. In that use-case, col(expression) can be useful. Generically illustrated using pySpark syntax:

>>> cX = col('col0')  # Define an unresolved column.                                                                           
>>> cY = col('myCol') # Define another unresolved column.                                                  
>>> cX,cY             # Show that these are naked column names.                                                                                            
(Column<b'col0'>, Column<b'myCol'>)

Now these are called unresolved columns because they are not associated with a DataFrame statement to actually know whether those column names actually exist anywhere. However you may, in fact, apply them in a DF context later on, after having prepared them:

>>> df = spark_sesn.createDataFrame([Row(col0=10, col1='Ten', col2=10.0),])                                
>>> df                                                                                                     
>>> DataFrame[col0: bigint, col1: string, col2: double]

>>> df.select(cX).collect()                                                                                
[Row(col0=10)]                      # cX is successfully resolved.

>>> df.select(cY).collect()                                                                                
Traceback (most recent call last):  # Oh dear! cY, which represents
[ ... snip ... ]                    # 'myCol' is truly unresolved here.
                                    # BUT maybe later on it won't be, say,
                                    # after a join() or something else.

CONCLUSION:

col(expression) can help programmatically decouple the DEFINITION of a column specification with the APPLICATION of it against DataFrame(s) later on. Note that expr(aString), which also returns a column specification, provides a generalization of col('xyz'), where whole expressions can be DEFINED and later APPLIED:

>>> cZ = expr('col0 + 10')   # Creates a column specification / expression.
>>> cZ
Column<b'(col0 + 10)'>

>>> df.select(cZ).collect() # Applying that expression later on.
[Row((col0 + 10)=20)]

I hope this alternative use-case helps.

Upvotes: 4

user9137650
user9137650

Reputation: 101

In majority of contexts there is no practical difference. For example:

val df: Dataset[Row] = ???

df.select(df.col("foo"))
df.select(col("foo"))

are equivalent, same as:

df.where(df.col("foo") > 0)
df.where(col("foo") > 0)

Difference becomes important when provenance matters, for example joins:

val df1: Dataset[Row] = ???
val df2: Dataset[Row] = ???

df1.join(df2, Seq("id")).select(df1.col("foo") =!= df2.col("foo"))

Because Dataset.col is resolved and bound to a DataFrame it allows you to unambiguously select column descending from a particular parent. It wouldn't be possible with col.

Upvotes: 10

Related Questions