Reputation: 477
I have the following code to read data from a parquet to Dataframe
DataFrame addressDF = sqlContext.read().parquet(addressParquetPath);
How do i read data from parquet to DATA SET?
Dataset dataset = sqlContext.createDataset(sqlContext.read().parquet(propertyParquetPath).toJavaRDD(), Encoder.);
What should the Encoder parameter contain? Also, Do i have to create a property class and then pass that or how is it?
Upvotes: 3
Views: 3613
Reputation: 3226
The Encoder for a type T
is the class that tells Spark how instances of T
can be decoded and~ encoded from the internal Spark representation. It contains the schema of the class and the scala ClassTag which is used to create your class via reflection.
In your code, you don't specialize Dataset over any type T, so I cannot create an Encoder for you but I can give you as example the one from Databricks Spark documentation, which I suggest to read because it is great.
First of all, let's create the class University
that we want to load into a DateSet:
public class University implements Serializable {
private String name;
private long numStudents;
private long yearFounded;
public void setName(String name) {...}
public String getName() {...}
public void setNumStudents(long numStudents) {...}
public long getNumStudents() {...}
public void setYearFounded(long yearFounded) {...}
public long getYearFounded() {...}
}
Now, University
is a Java Bean and the Spark Encoders
library provides a way to create encoders for Java Beans with the function bean:
Encoder<University> universityEncoder = Encoders.bean(University.class)
which can then be used to read a Dataset of University
from parquet without first loading them into a DataFrame (which is redundant):
Dataset<University> schools = context.read().json("/schools.json").as(universityEncoder);
and now schools
is a Dataset<University>
read from a parquet file.
Upvotes: 3