Reputation: 29477
I'm trying to read a Cassandra table (mykeyspace.mytable
) from inside a Spark 2.1 job (using Scala 2.11):
val myDataset = sqlContext
.read
.format("org.apache.spark.sql.cassandra")
.options(Map("table" -> "mytable", "keyspace" -> "mykeyspace"))
.load()
myDataset.show()
println(s"Ping and the count is: ${myDataset.count}")
myDataset.foreach(t => println("Weee"))
println("Pong")
When this runs, the console output is:
+--------------+-----------+
| username|modified_at|
+--------------+-----------+
|sluggoo-flibby| null|
+--------------+-----------+
Ping and the count is: 1
Pong
So there's clearl a single record in this table...but why is my foreach
loop "not working?" Why don't I see my "Weee" output?
Upvotes: 0
Views: 2226
Reputation: 11577
The foreach operation doesn't run on your local machine.. it runs on the remote machine where your spark executors are running. Thus the println is not executed on your local machine but on the remote executor.
to have it printed on your local machine you should collect on the dataframe and have all the data of the dataframe on your driver (which runs on your local machine) and execute a foreach on that local collection as shown below.
myDataset.collect.foreach(println)
Note: be careful on using collect
on a RDD
or a Dataframe
. collect downloads all the data from the distributed collection to the local memory which could lead to java.lang.OutOfMemoryError
Exceptions.
Upvotes: 1
Reputation: 35219
I guess you don't see the output because println
outputs to the standard output of the worker, not driver. This is a common mistake with RDDs (View RDD contents in Python Spark?), but it applies also to Dataset
.
You can collect
but it of course is not advised for large data:
myDataset.collect.foreach(t => println("Weee"))
Upvotes: 4