smeeb
smeeb

Reputation: 29477

Non-Empty Spark Dataset foreach not executing

I'm trying to read a Cassandra table (mykeyspace.mytable) from inside a Spark 2.1 job (using Scala 2.11):

val myDataset = sqlContext
     .read
     .format("org.apache.spark.sql.cassandra")
     .options(Map("table" -> "mytable", "keyspace" -> "mykeyspace"))
     .load()

myDataset.show()

println(s"Ping and the count is: ${myDataset.count}")
myDataset.foreach(t => println("Weee"))
println("Pong")

When this runs, the console output is:

+--------------+-----------+
|      username|modified_at|
+--------------+-----------+
|sluggoo-flibby|       null|
+--------------+-----------+

Ping and the count is: 1
Pong

So there's clearl a single record in this table...but why is my foreach loop "not working?" Why don't I see my "Weee" output?

Upvotes: 0

Views: 2226

Answers (2)

rogue-one
rogue-one

Reputation: 11577

The foreach operation doesn't run on your local machine.. it runs on the remote machine where your spark executors are running. Thus the println is not executed on your local machine but on the remote executor.

to have it printed on your local machine you should collect on the dataframe and have all the data of the dataframe on your driver (which runs on your local machine) and execute a foreach on that local collection as shown below.

myDataset.collect.foreach(println)

Note: be careful on using collect on a RDD or a Dataframe. collect downloads all the data from the distributed collection to the local memory which could lead to java.lang.OutOfMemoryError Exceptions.

Upvotes: 1

Alper t. Turker
Alper t. Turker

Reputation: 35219

I guess you don't see the output because println outputs to the standard output of the worker, not driver. This is a common mistake with RDDs (View RDD contents in Python Spark?), but it applies also to Dataset.

You can collect but it of course is not advised for large data:

 myDataset.collect.foreach(t => println("Weee"))

Upvotes: 4

Related Questions