Bîrsan Octav
Bîrsan Octav

Reputation: 69

Apache Spark having Dataset of a parameterised/generic class in Java

I've always wandered if having a Dataset of a parameterised/generic class is possible in Java. To be more clear, what I am looking to achieve is something like this:

Dataset<MyClass<Integer>> myClassInteger;
Dataset<MyClass<String>> myClassString;

Please let me know if this is possible. If you could also show me how to achieve this, I would be very appreciative. Thanks!

Upvotes: 2

Views: 842

Answers (2)

Hyo Byun
Hyo Byun

Reputation: 1276

Sorry this question is old, but I wanted to put some notes down since I was able to work with generic/parameterized classes for Datasets in java by creating a generic class that took a type parameter, and subsequently put methods inside that parameterized class. Ie, class MyClassProcessor<T1> where T1 could be Integer or String.

Unfortunately, you will not enjoy full benefits of generic types in this case, and you will have to perform some workarounds:

  • I had to use Encoders.kryo(), otherwise the generic types became Object with some operations and could not be cast correctly to the generic type.
    • This introduces some other annoyances, ie can't join. I had to use tricks like using Tuples to allow for some join operations.
  • I haven't tried reading generic types, my parameterized classes were introduced later using map. For example, I read TypeA and later worked with Dataset<MyClass>.
  • I was able to use more complex, custom types in the generics, not just Integer, String, etc...
  • There were some annoying details like having to pass along Class literals, ie TypeA.class and using raw Types for certain map functions etc...

Upvotes: 1

Ajay Kr Choudhary
Ajay Kr Choudhary

Reputation: 1352

Yes, you can have Dataset of your own class. It Would look like Dataset<MyOwnClass>

In the code below I have tried to read a file content and put it in the Dataset of the class that we have created. Please check the snippet below.

import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Encoder;
import org.apache.spark.sql.Encoders;
import org.apache.spark.sql.SparkSession;

import java.io.Serializable;

public class FileDataset {
    public static class Employee implements Serializable {
        public int key;
        public int value;
    }

    public static void main(String[] args) {
        // configure spark
        SparkSession spark = SparkSession
                .builder()
                .appName("Reading JSON File into DataSet")
                .master("local[2]")
                .getOrCreate();

        final Encoder<Employee> employeeEncoder = Encoders.bean(Employee.class);

        final String jsonPath = "/Users/ajaychoudhary/Documents/student.txt";

        // read JSON file to Dataset
        Dataset<Employee> ds = spark.read()
                .json(jsonPath)
                .as(employeeEncoder);
        ds.show();
    }
}

The content of my student.txt file is

{ "key": 1, "value": 2 }
{ "key": 3, "value": 4 }
{ "key": 5, "value": 6 }

It produces the following output on the console:

+---+-----+
|key|value|
+---+-----+
|  1|    2|
|  3|    4|
|  5|    6|
+---+-----+

I hope this gives you an initial idea of how you can have the dataset of your own custom class.

Upvotes: -1

Related Questions