Ian
Ian

Reputation: 1304

Distinct values from DataFrame to Array

I'm trying to get the distinct values of a single column of a DataFrame (called: df) into an Array that matches the data type of the column. This is what I've tried, but it does not work:

def distinctValues[T: ClassTag](column: String): Array[T] = {
      df.select(df(column)).distinct.map {
        case Row(s: T) => s
      }.collect
    }

The method is inside an implicit class, so calling df.distinctValues("some_col") gives me:

scala.MatchError: [ABCD] (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)

Is there an elegant way to achieve what I want that is also type safe?

I'm on Spark 1.4.1.

Upvotes: 4

Views: 4647

Answers (2)

Paweł Jurczenko
Paweł Jurczenko

Reputation: 4471

The problem is that you're using pattern matching instead of the getAs method:

implicit final class DataFrameOps(val df: DataFrame) {
  def distinctValues[T: ClassTag](column: String): Array[T] = {
     df.select(column).distinct().map(_.getAs[T](column)).collect()
  }
}

Usage:

val ageArray: Array[Int] = df.distinctValues("age")
or
val ageArray = df.distinctValues[Int]("age")

Upvotes: 5

Vitalii Kotliarenko
Vitalii Kotliarenko

Reputation: 2967

Since 1.4.0, Spark has dropDuplicates method that implements distinct by sequence of columns (or by all columns, if none is specified):

//drop duplicates considering specified columns
val distinctDf = df.select($"column").dropDuplicates(Seq("column"))
//this should work too since df has one column after select
val distinctDf = df.select($"column").dropDuplicates()
//collect
def getValues[T](df: DataFrame, columnName: String) = {
  df.map(_.getAs[T](columnName)).collect()
}

getValues[String](distinctDf, "column")

Upvotes: 1

Related Questions