Alon
Alon

Reputation: 11885

Correct Spark dataframe type for arrays

Regarding the following code:

val types = df.schema.fields.map(_.dataType)

types(i) match {
     case StringType => println("String")
     case ArrayType => println("Array")
     case _ => println("Miscellaneous")
}

ArrayType produces an error:

Pattern type is incompatible with expected type, found: ArrayType.type, required: DataType

On first glance it looks weird because both StringType (which doesn't produce an error) and ArrayType are located at the same location (org.apache.spark.sql.types).

But then after checking the hierarchy tree, I've figured out that the closest ancestor of both classes is AbsractDataType:

AbsractDataType -> DataType -> AtomicType -> StringType

vs

AbsractDataType -> ArrayType

types is an Array of DataType.

So I could probably just cast types to an Array of AbsractDataType, but I don't think it's the solution because if types is an Array of DataType then it means that also cells that contain arrays should be included there.

Upvotes: 0

Views: 795

Answers (2)

abiratsis
abiratsis

Reputation: 7316

First of all both ArrayType and StringType have basic class DataType. The hierarchy in both cases has as next:

StringType > AtomicType > DataType > AbsractDataType

ArrayType > DataType > AbsractDataType

I believe that in your example you confused the companion object of ArrayType which indeed inherits AbsractDataType object directly.

That means we can use pattern matching to derived types of DataType as you already did but the matching should occur based on the given constructor:

import org.apache.spark.sql.types._

val df = Seq(
  ("1", Seq("A", "D", "B")),
  ("2", Seq("A", "D", "B")),
  ("3", Seq("B", "C", "G")),
  ("5", Seq("B", "D", "L"))
).toDF("ID","ar")

df.schema.fields.map{
    _.dataType match {
         case StringType => println("String")
         case ArrayType(_, _) => println("Array")
         case _ => println("Miscellaneous")
    }
}

// String
// Array

You can see that ArrayType expects two arguments since the definition is:

case class ArrayType(elementType: DataType, containsNull: Boolean) extends DataType

but StringType it doesn't:

class StringType private() extends AtomicType

Finally, if you only intended to get the string representation of each type you could also use the typeName property of dataType:

df.schema.fields.map{_.dataType.typeName}

// res24: Array[String] = Array(string, array)

References

Understanding Pattern Matching with Sub-classes

https://users.scala-lang.org/t/correct-syntax-in-pattern-matching/2187/4

Upvotes: 1

pasha701
pasha701

Reputation: 7207

Such construction can work:

types(i) match {
  case StringType => println("String")
  case _:ArrayType => println("Array")
  case _ => println("Miscellaneous")
}

Upvotes: 0

Related Questions