Nikhil Agrawal
Nikhil Agrawal

Reputation: 26518

Reading Row data from Spark Dataset in Loop

I want to read spark dataset rows in loop with Java and I have to read other datasets inside.

Suppose ds is the dataset and if write loop as below I can read other datasets

ds.toJavaRDD().collect().forEach()

but I remove collect() and JavaRDD() and directly apply

ds.foreach()

then I am not able to read other datasets. How can I solve this?

Upvotes: 1

Views: 1653

Answers (1)

werner
werner

Reputation: 14845

Reading a dataset (lets say from HDFS or a local file system) is an operation that is started from within the driver process. Any code running within an executor process cannot use the SparkSession, this API only lives on the driver.

The difference between ds.toJavaRDD().collect().forEach(myFunction) and ds.foreach(myFunction) is that in the first statement myFunction is executed within the driver process and in the second one myFunction is executed within an executor process and therefore the Spark-API cannot be used.

ds.toJavaRDD().collect() returns a plain Java List object, all data of the Spark dataset is moved to the driver. This list is a standard Java object that lives within the driver process. foreach is here a method from the java.lang.Iterable interface.

ds.foreach() on the other hand is a method of Spark dataset and its parameter method will be executed in parallel within the different Spark executors.

Upvotes: 1

Related Questions