gaurav5430
gaurav5430

Reputation: 13902

Spark : Creating Object RDD from List<Object> RDD

Assume Employee is a Java Class.

I have a JavaRDD<Employee[]> arrayOfEmpList, i.e, each RDD has an array of employees.

Out of this, I want to create a single list of employees, something like

JavaRDD<Employee>

This is what i tried: Created a List<Employee> empList = new ArrayList<Employee>();

then foreach RDD of Employee[]:

arrayOfEmpList.forEach(new VoidFunction<Employee[]>(){
public void call(Employee[] arg0){
   empList.addAll(Arrays.asList(arg0));
   System.out.println(empList.size()); //prints correct values incrementally
}
});

System.out.println(empList.size()); //gives 0

I am not able to get the size outside foreach loop.

Is there some other way to achieve this?

P.S: i want to have all employee records as separate RDD, so 1st employee list may contain 10 records, 2nd may contain 100 records, 3rd may contain 200 records. i want a final list of 330 records, which i can then parallelize and perform actions upon.

Upvotes: 0

Views: 3851

Answers (1)

ernest_k
ernest_k

Reputation: 45319

What you need is the flatMap transformation on your array. I'm first converting your employee array into a list:

JavaRDD<Employee> employeeRDD = arrayOfEmployeeList.flatMap(empArray -> Arrays.asList(empArray));

Check, perhaps the method has an overload that takes an array directly, not just a collection.

You can see this in the transformations section of the programming guide: http://spark.apache.org/docs/latest/programming-guide.html#transformations

JavaDocs: http://spark.apache.org/docs/latest/api/java/org/apache/spark/api/java/JavaRDDLike.html#flatMap(org.apache.spark.api.java.function.FlatMapFunction)

Upvotes: 1

Related Questions