Srinu Babu
Srinu Babu

Reputation: 422

How to convert Java ArrayList to Apache Spark Dataset?

I have a list like this:

List<String> dataList = new ArrayList<>();
dataList.add("A");
dataList.add("B");
dataList.add("C");

I need to convert Dataset<Row> dataDs = Seq(dataList).toDs();

Upvotes: 4

Views: 18164

Answers (3)

Aayush Shah
Aayush Shah

Reputation: 520

This is the derived answer that worked for me. It is inspired from NiharGht's answer.

  • suppose we have the list like this (not to run but just idea)
List<List<Integer>> data = [
  [1, 2, 3],
  [2, 3, 4],
  [3, 4, 5]
];
  • Now to convert each List to Row so that can be used to make DF
List<Row> rows = new ArrayList<>();
for (List<Integer> that_line : data){
    Row row = RowFactory.create(that_line.toArray());
    rows.add(row);
}
  • Then just make the dataframe! (no instead of using RDD, use the List
Dataset<Row> r2DF = sparkSession.createDataFrame(rows, schema); // supposing you have schema already.
r2DF.show();

The catch is in this line:

Dataset<Row> r2DF = sparkSession.createDataFrame(rows, schema);

It is where we are usually using RDD instead of the List.

Upvotes: 0

NiharGht
NiharGht

Reputation: 161

You can convert a List<String> to Dataset<Row> like so:

  1. Get a List<Object> from List<String> on each element with correct Object class. eg - Integer, String, etc.

  2. Generate List<Row> from List<Object>

  3. Get datatypeList and headerList which you want for Dataset<Row> schema.

  4. Construct the schema object:

  5. Create dataset

List<Object> data = new ArrayList();
data.add("hello");
data.add(null);

List<Row> ls = new ArrayList<Row>();
Row row = RowFactory.create(data.toArray());
ls.add(row);

List<DataType> datatype = new ArrayList<String>();
datatype.add(DataTypes.StringType);
datatype.add(DataTypes.IntegerType);
List<String> header = new ArrayList<String>();
headerList.add("Field_1_string");
headerList.add("Field_1_integer");

StructField structField1 = new StructField(headerList.get(0), datatype.get(0), true, org.apache.spark.sql.types.Metadata.empty());

StructField structField2 = new StructField(headerList.get(1), datatype.get(1), true, org.apache.spark.sql.types.Metadata.empty());
List<StructField> structFieldsList = new ArrayList<>();
structFieldsList.add(structField1);
structFieldsList.add(structField2);

StructType schema = new StructType(structFieldsList.toArray(new StructField[0]));

Dataset<Row> dataset = sparkSession.createDataFrame(ls, schema);

dataset.show();
dataset.printSchema();

Upvotes: 3

Roy
Roy

Reputation: 66

List<String> data = Arrays.asList("abc", "abc", "xyz");
Dataset<String> dataDs = spark.createDataset(data, Encoders.STRING());
Dataset<String> dataListDs = spark.createDataset(dataList,    Encoders.STRING());
dataDs.show();

Upvotes: 5

Related Questions