Reputation: 100
I have noticed two different notation styles when referencing columns (in this case in a select statement). Is there a functional difference between the two?
val df = spark.read.table("mytable").select('column1,'column2)
vs.
val df = spark.read.table("mytable").select($"column1",$"column2")
I haven't been able to find anything that really explains the difference or if there is a standard.
Upvotes: 1
Views: 479
Reputation: 714
When using any of 'column1,'column2
or $"column1",$"column2"
, the returned value is going to be ColumnName(column1), ColumnName(column2)
which happens to be a sub class of Column()
that is one of the expected types to select
. However, their implementations are different.
In order to use the symbols, an import import spark.implicits._
to be included in the application where spark
is a SparkSession
object. The import ensures that following implicits are available and in-scope.
From Spark code,
@Experimental
object implicits extends SQLImplicits with Serializable {
protected override def _sqlContext: SQLContext = SparkSession.this.sqlContext
}
spark.implicits
extends SQLImplicits
package org.apache.spark.sql
abstract class SQLImplicits extends LowPrioritySQLImplicits {
...
/**
* Converts $"col name" into a [[Column]].
*
* @since 2.0.0
*/
implicit class StringToColumn(val sc: StringContext) {
def $(args: Any*): ColumnName = {
new ColumnName(sc.s(args: _*))
}
}
...
/**
* An implicit conversion that turns a Scala `Symbol` into a [[Column]].
* @since 1.3.0
*/
implicit def symbolToColumn(s: Symbol): ColumnName = new ColumnName(s.name)
}
When using $column1
, the $
method in implicit class StringToColumn
is invoked that converts a String
to a ColumnName
instance.
'
is a scala symbol. In order to use it import spark.implicits._
is not required. however, to a convert a Scala symbol to a Column
, the import is needed. When using this symbol, the implicit method symbolToColumn
will get executed and returns a ColumnName
instance. Note that 'column1
is same as Symbol("column1")
in Scala.
From org.apache.spark.sql.Column.scala
, ColumnName
is a subclass of Column
. so the returned objects from $
and '
can be used in DataFrame/Dataset select
methods.
/**
* A convenient class used for constructing schema.
*
* @since 1.3.0
*/
@InterfaceStability.Stable
class ColumnName(name: String) extends Column(name) {
...
}
Upvotes: 2