Reputation: 829
Suppose we have a sequence of Any
val seq = Seq(1,2,null)
seq: Seq[Any] = List(1, 2, null)
Now if one filters the non-null element obtains a new sequence
val cleanSeq = seq.filterNot(_ == null)
cleanSeq: Seq[Any] = List(1, 2)
Now I would like to obtain the same type as I would obtain if I create a new sequence like cleanSeq
val seq2 = Seq(1,2)
seq2: Seq[Int] = List(1, 2)
Can I obtain somehow Seq[Int]
starting from cleanSeq
?
UPDATE
The previous was only a dummy example and I can have other types than Int
, also complex types example: Array[Map[String, Float]]
.
The only assumption I can make is that I have a sequence which might contain nulls. But the other elements in the sequence have a common super type except Any
. After removing the nulls, I want to find the common super type.
UPDATE
The use case
I want to create a spark dataframe from columns which have a name and values. The values are stored in a Seq
. From the type of the Seq
I want to derive the schema for the dataframe.
Definition of column
import reflect.runtime.universe._
import org.apache.spark.sql.catalyst.ScalaReflection
import org.apache.spark.sql.types._
case class InternalColumn[A: TypeTag](colName: String, col: Seq[A]) {
private def getType: DataType = ScalaReflection.schemaFor[A].dataType
def genStructField: StructField = StructField(colName, getType)
}
Creation of dataframe:
def createDF[T](data: Seq[T], sch: StructType): DataFrame = {
val dataRow: Seq[Row] = data.map {
case row: Row => row
case prod: Product => Row(prod.productIterator.toSeq: _*)
case d => Row(d)
}
spark.createDataFrame(sc.makeRDD(dataRow), sch)
}
Usage
def createFromColumns(data: Seq[InternalColumn[_]]): DataFrame = {
def loop(schema: StructType, cols: Seq[InternalColumn[_]]): StructType = cols.toList match {
case Nil => schema
case h :: t => loop(schema.add(h.genStructField), t)
}
val sch: StructType = loop(new StructType(), data)
createDF(data.map(_.col).transpose.map(Row.fromSeq), sch)
}
val df = createFromColumns(List(InternalColumn("c1", List(1,2,3)), InternalColumn("c2", List("a", "b", "c"))))
scala> df.show()
+---+---+
| c1| c2|
+---+---+
| 1| a|
| 2| b|
| 3| c|
+---+---+
scala> df.printSchema
root
|-- c1: integer (nullable = true)
|-- c2: string (nullable = true)
This works so far so good. But one might wants to create a dataframe which has a column containing nulls. If the column which has null is nullable, example it has StringType
then it still works.
The problem appear when you have nulls in a column which is not nullable, ex:
scala> val df = createFromColumns(List(InternalColumn("c1", List(1,2,null)), InternalColumn("c2", List("a", "b", "c"))))
java.lang.UnsupportedOperationException: Schema for type Any is not supported
This is the reason I want to infer the type of the sequence when I removed all the nulls.
Upvotes: 0
Views: 131
Reputation: 262474
You should try not to lose the proper type in the first place (Seq[Any]
is a bad code smell, as is using null
), but to get back the real type you can do runtime typechecks:
seq.collect{ case x: Int => x }
This will be Seq[Int]
again, having thrown out everything that is not an Int
.
If you didn't actually have other things in there, but just Int
or null
, consider using Option[Int]
instead:
val seq: Seq[Option[Int]] = Seq(Some(1), Some(2), None)
// then you can do
seq.flatten // gives you Seq[Int]
After removing the nulls, I want to find the common super type.
You do have to know (at the time you write the program) what that common super type is supposed to be. Then you can write code that checks against that type.
If you don't know that you want a Seq[Int]
out of this, what will you declare the type of the result to be? Keep in mind that these generic types (the Int
in Seq[Int]
) are erased at runtime and only exist for the benefit of static type-checking at compile-time.
Upvotes: 5