Alessio Rossotti
Alessio Rossotti

Reputation: 314

Spark createDataframe from RDD of objects, column order

I'm trying to convert an RDD of my custom objects (a Java class) to a Dataframe, I simply used the method hiveContext.createDataframe specifying the class of the object. The problem is that the dataframe is created with columns in some strange order, and once I write the DF to Hive the values are in the wrong columns. Here is my code:

var objectRDD = tableDF.map((r: Row) => new Attuatore(r(0),r(1)...)) [.. operations with the RDD ..] val resultDF = hiveContext.createDataFrame(objectRDD, classOf[Attuatore]) resultDF.write.mode("append").saveAsTable(outputTable)

The only solution I found so far for having the fields in the right order is to convert back the RDD[Attuatore] to an RDD[Row], and then call createDataFrame() specifying the schema, but since I have to do this with a lot of classes I would prefer the first approach to have a much cleaner code.

Upvotes: 1

Views: 1522

Answers (1)

Peter Halverson
Peter Halverson

Reputation: 418

As the documentation for HiveContext.createDataFrame says

Since there is no guaranteed ordering for fields in a Java Bean, SELECT * queries will return the columns in an undefined order.

So if you need to put fields in a defined order, you have to do it explicitly, e.g.

val MY_COLUMNS = Seq("field1", "field2", ...)
val conformedDF = resultDF.select(MY_COLUMNS.map(col(_)):_*)
conformedDF.write...

Upvotes: 1

Related Questions