the_Kid26
the_Kid26

Reputation: 59

Why is there a type mismatch when unioning a collection of Datasets

I'm trying to get the union of an ArrayBuffer[Dataset[_]].

So I wrote the the following code:

var buffer: ArrayBuffer[Dataset[_]] = ArrayBuffer.empty[Dataset[_]]

var size:Long = 0 
...

if size < 1000 {
  buffer.append(df)
  size = size + df.count()
} else {
  val unionedDataset = buffer.reduce(_ union _)
}

I get the following error:

type mismatch;
[error]  found   : org.apache.spark.sql.Dataset[_$2(in value $anonfun)] where type _$2(in value $anonfun)
[error]  required: org.apache.spark.sql.Dataset[_$2(in variable buffer)]
[error]           val unionedDataset = buffer.reduce(_ union _)
[error]                                                      ^

Shouldn't the type of the second argument in the anonymous function be the same type of the object at the index that was referenced?

Upvotes: 0

Views: 591

Answers (3)

the_Kid26
the_Kid26

Reputation: 59

I figured out that I can avoid this issue by doing the following:

val unionedDataset = buffer.reduce(_.toDF() union _.toDF())

Upvotes: 1

Raphael Roth
Raphael Roth

Reputation: 27373

You could use Any instead of _, this should also work:

var buffer: ArrayBuffer[Dataset[Any]] = ArrayBuffer.empty[Dataset[Any]]

var size:Long = 0 
...

if size < 1000 {
  buffer.append(df.asInstanceOf[Dataset[Any]])
  size = size + df.count()
} else {
  val unionedDataset = buffer.reduce(_ union _)
}

Upvotes: 0

Alexey Romanov
Alexey Romanov

Reputation: 170815

ArrayBuffer[Dataset[_]] can contain e.g. a Dataset[String] and a Dataset[Int] at the same time, and union isn't defined for them.

If you had ArrayBuffer[Dataset[T]] forSome { type T }, you could write buffer.reduce(_ union _) but then buffer.append(df) won't work: df must have type Dataset[T] but you don't know what T is.

Upvotes: 0

Related Questions