Reputation: 422
I have a list like this:
List<String> dataList = new ArrayList<>();
dataList.add("A");
dataList.add("B");
dataList.add("C");
I need to convert Dataset<Row> dataDs = Seq(dataList).toDs();
Upvotes: 4
Views: 18164
Reputation: 520
This is the derived answer that worked for me. It is inspired from NiharGht's answer.
List<List<Integer>> data = [
[1, 2, 3],
[2, 3, 4],
[3, 4, 5]
];
List<Row> rows = new ArrayList<>();
for (List<Integer> that_line : data){
Row row = RowFactory.create(that_line.toArray());
rows.add(row);
}
Dataset<Row> r2DF = sparkSession.createDataFrame(rows, schema); // supposing you have schema already.
r2DF.show();
The catch is in this line:
Dataset<Row> r2DF = sparkSession.createDataFrame(rows, schema);
It is where we are usually using RDD
instead of the List.
Upvotes: 0
Reputation: 161
You can convert a List<String>
to Dataset<Row>
like so:
Get a List<Object>
from List<String>
on each element with correct Object class. eg - Integer, String, etc.
Generate List<Row>
from List<Object>
Get datatypeList and headerList which you want for Dataset<Row>
schema.
Construct the schema object:
Create dataset
List<Object> data = new ArrayList();
data.add("hello");
data.add(null);
List<Row> ls = new ArrayList<Row>();
Row row = RowFactory.create(data.toArray());
ls.add(row);
List<DataType> datatype = new ArrayList<String>();
datatype.add(DataTypes.StringType);
datatype.add(DataTypes.IntegerType);
List<String> header = new ArrayList<String>();
headerList.add("Field_1_string");
headerList.add("Field_1_integer");
StructField structField1 = new StructField(headerList.get(0), datatype.get(0), true, org.apache.spark.sql.types.Metadata.empty());
StructField structField2 = new StructField(headerList.get(1), datatype.get(1), true, org.apache.spark.sql.types.Metadata.empty());
List<StructField> structFieldsList = new ArrayList<>();
structFieldsList.add(structField1);
structFieldsList.add(structField2);
StructType schema = new StructType(structFieldsList.toArray(new StructField[0]));
Dataset<Row> dataset = sparkSession.createDataFrame(ls, schema);
dataset.show();
dataset.printSchema();
Upvotes: 3
Reputation: 66
List<String> data = Arrays.asList("abc", "abc", "xyz");
Dataset<String> dataDs = spark.createDataset(data, Encoders.STRING());
Dataset<String> dataListDs = spark.createDataset(dataList, Encoders.STRING());
dataDs.show();
Upvotes: 5