Converting typed JavaRDD to Row JavaRDD

Question

I am trying to convert a typed rdd to row rdd and then creating the dataframe from it. It throws exception when I execute code.

code:

JavaRDD rdd = sc.parallelize(counters);
JavaRDD rowRDD = rdd.map((Function) RowFactory::create);

//I am using some schema here based on the class Counter
DataFrame df = sqlContext.createDataFrame(rowRDD, getSchema());
marineDF.show(); //throws Exception

Does conversion from typed rdd to row rdd preserve the order in the row factory? If not how do I make sure of that?

Class code :

class Counter {
  long vid;
  byet[] bytes; 
  List blist;
}
class B {
  String id;
  long count;
}

schema:

private StructType getSchema() { List fields = new ArrayList<>(); fields.add(DataTypes.createStructField("vid", DataTypes.LongType, false)); fields.add(DataTypes.createStructField("bytes",DataTypes.createArrayType(DataTypes.ByteType), false)); List bFields = new ArrayList<>(); bFields.add(DataTypes.createStructField("id", DataTypes.StringType, false)); bFields.add(DataTypes.createStructField("count", DataTypes.LongType, false)); StructType bclasSchema = DataTypes.createStructType(bFields); fields.add(DataTypes.createStructField("blist", DataTypes.createArrayType(bclasSchema, false), false)); StructType schema = DataTypes.createStructType(fields); return schema; }

fails with exception :

java.lang.ClassCastException: test.spark.SampleTest$A cannot be cast to java.lang.Long at scala.runtime.BoxesRunTime.unboxToLong(BoxesRunTime.java:110) at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getLong(rows.scala:42) at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getLong(rows.scala:221) at org.apache.spark.sql.catalyst.CatalystTypeConverters$LongConverter$.toScalaImpl(CatalystTypeConverters.scala:367)

zero323 · Accepted Answer

The thing is there is no conversion here. When you create a Row it can accept an arbitrary Object. It is placed as is. So it is not equivalent to a DataFrame creation:

spark.createDataFrame(rdd, Counter.class);

or a Dataset creation:

Encoder encoder = Encoders.bean(Counter.class);
spark.createDataset(rdd, encoder);

when working with bean classes.

So RowFactory::create is just not applicable here. If you want to pass RDD all values should be already represented in a form that can be directly used with DataFrame with required type mapping. It means you have to explicitly map each Counter to Row of the following shape:

Row(vid, bytes, List(Row(id1, count1), ..., Row(idN, countN))

and your code should be equivalent to:

JavaRDD rows = counters.map((Function) cnt -> {
  return RowFactory.create(
    cnt.vid, cnt.bytes,
    cnt.blist.stream().map(b -> RowFactory.create(b.id, b.count)).toArray()
  );
});

Dataset df = sqlContext.createDataFrame(rows, getSchema());

Converting typed JavaRDD to Row JavaRDD

Answers (1)

Related Questions