uh_big_mike_boi
uh_big_mike_boi

Reputation: 3470

toDF() not handling RDD

I have an RDD of Rows called RowRDD. I am simply trying to convert into DataFrame. From the examples I have seen on the internet from various places, I am seeing that I shoudl be trying RowRDD.toDF() I am getting the error :

value toDF is not a member of org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]

Upvotes: 1

Views: 2844

Answers (1)

zero323
zero323

Reputation: 330193

It doesn't work because Row is not a Product type and createDataFrame with as single RDD argument is defined only for RDD[A] where A <: Product.

If you want to use RDD[Row] you have to provide a schema as the second argument. If you think about it is should be obvious. Row is just just a container of Any and as such it doesn't provide enough information for schema inference.

Assuming this is the same RDD as defined in your previous question then schema is easy to generate:

import org.apache.spark.sql.types._
import org.apache.spark.rdd.RD

val rowRdd: RDD[Row] = ???
val schema = StructType(
  (1 to rowRdd.first.size).map(i => StructField(s"_$i", StringType, false))
)

val df = sqlContext.createDataFrame(rowRdd, schema)

Upvotes: 6

Related Questions