Reputation: 224
In Spark using Scala - When we have to convert RDD[Row] to DataFrame. Why we have to convert the RDD[Row] to RDD of case classor RDD of tuple in order to use rdd.toDF() Any specific reason it was not provided for the RDD[Row]
object RDDParallelize {
def main(args: Array[String]): Unit = {
val spark:SparkSession = SparkSession.builder().master("local[1]")
.appName("learn")
.getOrCreate()
val abc = Row("val1","val2")
val abc2 = Row("val1","val2")
val rdd1 = spark.sparkContext.parallelize(Seq(abc,abc2))
import spark.implicits._
rdd1.toDF() //doesn't work
}
}
Upvotes: 0
Views: 456
Reputation: 584
it is confusing since there are implicit conversion for the toDF methode. Like you may have seen, toDF is not a methode of Rdd class, but it is defined in DatasetHolder, you are using rddToDatasetHolder in SQLImplicits to convert the rdd you created to a DatasetHolder. if you look into the methode rddToDatasetHolder,
implicit def rddToDatasetHolder[T : Encoder](rdd: RDD[T]): DatasetHolder[T] = {
DatasetHolder(_sqlContext.createDataset(rdd))
}
you will see that it requires an Encoder of T which is
Used to convert a JVM object of type T to and from the internal Spark SQL representation.
if you try to convert a Rdd[Row] to Datasetholder you will need one encoder to tell spark how you convert Row object to internal SQL representation. However
Primitive types (Int, String, etc) and Product types (case " + "classes) are supported by importing spark.implicits._ Support for serializing other types " + "will be added in future releases
spark does not have any encoder for Row type so such conversion never finished successfully.
Upvotes: 3