vdep
vdep

Reputation: 3590

convert RDD to Dataset in Java Spark

I have an RDD, i need to convert it into a Dataset, i tried:

Dataset<Person> personDS =  sqlContext.createDataset(personRDD, Encoders.bean(Person.class));

the above line throws the error,

cannot resolve method createDataset(org.apache.spark.api.java.JavaRDD Main.Person, org.apache.spark.sql.Encoder T)

however, i can convert to Dataset after converting to Dataframe. the below code works:

Dataset<Row> personDF = sqlContext.createDataFrame(personRDD, Person.class);
Dataset<Person> personDS = personDF.as(Encoders.bean(Person.class));

Upvotes: 10

Views: 23539

Answers (4)

Yauheni Leaniuk
Yauheni Leaniuk

Reputation: 454

StructType schema = new StructType()
                .add("Id", DataTypes.StringType)
                .add("Name", DataTypes.StringType)
                .add("Country", DataTypes.StringType);

Dataset<Row> dataSet = sqlContext.createDataFrame(yourJavaRDD, schema);

Be carefull with schema variable, not always easy to predict what datatype you need to use, sometimes it's better to use just StringType for all columns

Upvotes: -1

MSS
MSS

Reputation: 3633

In addition to accepted answer, if you want to create a Dataset<Row> instead of Dataset<Person> in Java, please try like this:

StructType yourStruct = ...; //Create your own structtype based on individual field types
Dataset<Row> personDS =  sqlContext.createDataset(personRDD.rdd(), RowEncoder.apply(yourStruct));

Upvotes: 1

vdep
vdep

Reputation: 3590

.createDataset() accepts RDD<T> not JavaRDD<T>. JavaRDD is a wrapper around RDD inorder to make calls from java code easier. It contains RDD internally and can be accessed using .rdd(). The following can create a Dataset:

Dataset<Person> personDS =  sqlContext.createDataset(personRDD.rdd(), Encoders.bean(Person.class));

Upvotes: 18

Chitral Verma
Chitral Verma

Reputation: 2853

on your rdd use .toDS() you will get a dataset.

Let me know if it helps. Cheers.

Upvotes: 1

Related Questions