Reputation: 26518
I want to read spark dataset rows in loop with Java and I have to read other datasets inside.
Suppose ds is the dataset and if write loop as below I can read other datasets
ds.toJavaRDD().collect().forEach()
but I remove collect() and JavaRDD() and directly apply
ds.foreach()
then I am not able to read other datasets. How can I solve this?
Upvotes: 1
Views: 1653
Reputation: 14845
Reading a dataset (lets say from HDFS or a local file system) is an operation that is started from within the driver process. Any code running within an executor process cannot use the SparkSession, this API only lives on the driver.
The difference between ds.toJavaRDD().collect().forEach(myFunction)
and ds.foreach(myFunction)
is that in the first statement myFunction
is executed within the driver process and in the second one myFunction
is executed within an executor process and therefore the Spark-API cannot be used.
ds.toJavaRDD().collect() returns a plain Java List object, all data of the Spark dataset is moved to the driver. This list is a standard Java object that lives within the driver process. foreach is here a method from the java.lang.Iterable interface.
ds.foreach() on the other hand is a method of Spark dataset and its parameter method will be executed in parallel within the different Spark executors.
Upvotes: 1