Reputation: 1304
I'm trying to get the distinct values of a single column of a DataFrame
(called: df
) into an Array that matches the data type of the column. This is what I've tried, but it does not work:
def distinctValues[T: ClassTag](column: String): Array[T] = {
df.select(df(column)).distinct.map {
case Row(s: T) => s
}.collect
}
The method is inside an implicit class, so calling df.distinctValues("some_col")
gives me:
scala.MatchError: [ABCD] (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
Is there an elegant way to achieve what I want that is also type safe?
I'm on Spark 1.4.1.
Upvotes: 4
Views: 4647
Reputation: 4471
The problem is that you're using pattern matching instead of the getAs
method:
implicit final class DataFrameOps(val df: DataFrame) {
def distinctValues[T: ClassTag](column: String): Array[T] = {
df.select(column).distinct().map(_.getAs[T](column)).collect()
}
}
Usage:
val ageArray: Array[Int] = df.distinctValues("age")
or
val ageArray = df.distinctValues[Int]("age")
Upvotes: 5
Reputation: 2967
Since 1.4.0, Spark has dropDuplicates
method that implements distinct by sequence of columns (or by all columns, if none is specified):
//drop duplicates considering specified columns
val distinctDf = df.select($"column").dropDuplicates(Seq("column"))
//this should work too since df has one column after select
val distinctDf = df.select($"column").dropDuplicates()
//collect
def getValues[T](df: DataFrame, columnName: String) = {
df.map(_.getAs[T](columnName)).collect()
}
getValues[String](distinctDf, "column")
Upvotes: 1