cryp
cryp

Reputation: 2385

How to get a value from Dataset and store it in a Scala value?

I have a dataframe which looks like this:

scala> avgsessiontime.show()
+-----------------+
|              avg|
+-----------------+
|2.073455735838315|
+-----------------+

I need to store the value 2.073455735838315 in a variable. I tried using

avgsessiontime.collect 

but that starts giving me Task not serializable exceptions. So to avoid that I started using foreachPrtition. But I dont know how to extract the value 2.073455735838315 in an array variable.

scala> avgsessiontime.foreachPartition(x => x.foreach(println))
[2.073455735838315]

But when I do this:

avgsessiontime.foreachPartition(x => for (name <- x) name.get(0))

I get a blank/empty result. Even the length returns empty.

avgsessiontime.foreachPartition(x => for (name <- x) name.length)

I know name is of type org.apache.spark.sql.Row then it should return both those results.

Upvotes: 0

Views: 9706

Answers (3)

Mahesh Chand
Mahesh Chand

Reputation: 3250

scala> val df = spark.range(10)
df: org.apache.spark.sql.Dataset[Long] = [id: bigint]

scala> df.show
+---+
| id|
+---+
|  0|
|  1|
|  2|
|  3|
|  4|
|  5|
|  6|
|  7|
|  8|
|  9|
+---+
scala> val variable = df.select("id").as[Long].collect
variable: Array[Long] = Array(0, 1, 2, 3, 4, 5, 6, 7, 8, 9)

Same way you can extract values of any type i.e double,string. You just need to give data type while selecting values from df.

Upvotes: 2

Ramesh Maharjan
Ramesh Maharjan

Reputation: 41987

rdd and dataframes/datasets are distributed in nature, and foreach and foreachPartition are executed on executors, transforming dataframe or rdd on executors itself without returning anything. So if you want to return the variable to the driver node then you will have to use collect.

Supposing you have a dataframe as

+-----------------+
|avg              |
+-----------------+
|2.073455735838315|
|2.073455735838316|
+-----------------+

doing the following will print all the values, which you can store in a variable too

avgsessiontime.rdd.collect().foreach(x => println(x(0)))

it will print

2.073455735838315
2.073455735838316

Now if you want only the first one then you can do

avgsessiontime.rdd.collect()(0)(0)

which will give you

2.073455735838315

I hope the answer is helpful

Upvotes: 1

akuiper
akuiper

Reputation: 215107

You might need:

avgsessiontime.first.getDouble(0)

Here use first to extract the Row object, and .getDouble(0) to extract value from the Row object.


val df = Seq(2.0743).toDF("avg")

df.show
+------+
|   avg|
+------+
|2.0743|
+------+

df.first.getDouble(0)
// res6: Double = 2.0743

Upvotes: 3

Related Questions