Reputation: 4647
Here's some statement: https://stackoverflow.com/a/45600938/4164722
Dataset.col returns resolved column while col returns unresolved column.
Can someone provide more details? When should I use Dataset.col()
and when functions.col
?
Thanks.
Upvotes: 6
Views: 10634
Reputation: 5659
EXPLANATION:
At times you may want to programmatically pre-create (i.e. ahead of time) column expressions
for later use -- before the related DataFrame(s) actually exists. In that use-case, col(expression)
can be useful. Generically illustrated using pySpark
syntax:
>>> cX = col('col0') # Define an unresolved column.
>>> cY = col('myCol') # Define another unresolved column.
>>> cX,cY # Show that these are naked column names.
(Column<b'col0'>, Column<b'myCol'>)
Now these are called unresolved
columns because they are not associated with a DataFrame statement to actually know whether those column names actually exist anywhere. However you may, in fact, apply them in a DF context later on, after having prepared them:
>>> df = spark_sesn.createDataFrame([Row(col0=10, col1='Ten', col2=10.0),])
>>> df
>>> DataFrame[col0: bigint, col1: string, col2: double]
>>> df.select(cX).collect()
[Row(col0=10)] # cX is successfully resolved.
>>> df.select(cY).collect()
Traceback (most recent call last): # Oh dear! cY, which represents
[ ... snip ... ] # 'myCol' is truly unresolved here.
# BUT maybe later on it won't be, say,
# after a join() or something else.
CONCLUSION:
col(expression)
can help programmatically decouple the DEFINITION of a column specification with the APPLICATION of it against DataFrame(s) later on. Note that expr(aString)
, which also returns a column specification
, provides a generalization of col('xyz')
, where whole expressions can be DEFINED and later APPLIED:
>>> cZ = expr('col0 + 10') # Creates a column specification / expression.
>>> cZ
Column<b'(col0 + 10)'>
>>> df.select(cZ).collect() # Applying that expression later on.
[Row((col0 + 10)=20)]
I hope this alternative use-case helps.
Upvotes: 4
Reputation: 101
In majority of contexts there is no practical difference. For example:
val df: Dataset[Row] = ???
df.select(df.col("foo"))
df.select(col("foo"))
are equivalent, same as:
df.where(df.col("foo") > 0)
df.where(col("foo") > 0)
Difference becomes important when provenance matters, for example joins:
val df1: Dataset[Row] = ???
val df2: Dataset[Row] = ???
df1.join(df2, Seq("id")).select(df1.col("foo") =!= df2.col("foo"))
Because Dataset.col
is resolved and bound to a DataFrame
it allows you to unambiguously select column descending from a particular parent. It wouldn't be possible with col
.
Upvotes: 10