Amitabh Ranjan
Amitabh Ranjan

Reputation: 1500

How to convert List to JavaRDD

We know that in spark there is a method rdd.collect which converts RDD to a list.

List<String> f= rdd.collect();
String[] array = f.toArray(new String[f.size()]);

I am trying to do exactly opposite in my project. I have an ArrayList of String which I want to convert to JavaRDD. I am looking for this solution for quite some time but have not found the answer. Can anybody please help me out here?

Upvotes: 36

Views: 59445

Answers (5)

malvadao
malvadao

Reputation: 3462

If you are using a .scala file, or you don't want to or cannot use JavaSparkContext, then you could:

  1. use SparkContext instead of JavaSparkContext
  2. convert your Java List to a Scala List
  3. use SparkContext's parallelize method

For example:

List<String> javaList = new ArrayList<>()
javaList.add("abc")
javaList.add("def")
sc.parallelize(javaList.asScala)

This will generate an RDD for you.

Upvotes: 0

mrsrinivas
mrsrinivas

Reputation: 35444

Adding to Sean Owen and others solutions

You can use JavaSparkContext#parallelizePairs for List ofTuple

List<Tuple2<Integer, Integer>> pairs = new ArrayList<>();
pairs.add(new Tuple2<>(0, 5));
pairs.add(new Tuple2<>(1, 3));

JavaSparkContext sc = new JavaSparkContext();

JavaPairRDD<Integer, Integer> rdd = sc.parallelizePairs(pairs);

Upvotes: 6

Abhash Kumar
Abhash Kumar

Reputation: 1228

There are two ways to convert a collection to a RDD.

1) sc.Parallelize(collection)
2) sc.makeRDD(collection)

Both of the method are identical, so we can use any of them

Upvotes: 4

Mantas
Mantas

Reputation: 65

List<StructField> fields = new ArrayList<>();
fields.add(DataTypes.createStructField("fieldx1", DataTypes.StringType, true));
fields.add(DataTypes.createStructField("fieldx2", DataTypes.StringType, true));
fields.add(DataTypes.createStructField("fieldx3", DataTypes.LongType, true));


List<Row> data = new ArrayList<>();
data.add(RowFactory.create("","",""));
Dataset<Row> rawDataSet = spark.createDataFrame(data, schema).toDF();

Upvotes: -3

Sean Owen
Sean Owen

Reputation: 66891

You're looking for JavaSparkContext.parallelize(List) and similar. This is just like in the Scala API.

Upvotes: 58

Related Questions