Reputation: 75
I want to achieve the below for a spark a dataframe. I want to keep appending new rows to a dataframe as shown in the below example.
for(a<- value)
{
val num = a
val count = a+10
//creating a df with the above values//
val data = Seq((num.asInstanceOf[Double], count.asInstanceOf[Double]))
val row = spark.sparkContext.parallelize(data).toDF("Number","count")
val data2 = data1.union(row)
val data1 = data2 --> currently this assignment is not possible.
}
I have also tried
for(a<- value)
{
val num = a
val count = a+10
//creating a df with the above values//
val data = Seq((num.asInstanceOf[Double], count.asInstanceOf[Double]))
val row = spark.sparkContext.parallelize(data).toDF("Number","count")
val data1 = data1.union(row) --> Union with self is not possible
}
How can I achieve this in spark.
Upvotes: 0
Views: 1019
Reputation: 27383
your data1
must be declared as var
:
var data1:DataFrame = ???
for(a<- value)
{
val num = a
val count = a+10
//creating a df with the above values//
val data = Seq((num.toDouble, count.toDouble))
val row = spark.sparkContext.parallelize(data).toDF("Number","count")
val data2 = data1.union(row)
data1 = data2
}
But I would not suggest to do this, better convert your entire value
(must be a Seq
?) to a dataframe, then union once. Many unions tend to be inefficient....
val newDF = value.toDF("Number")
.withColumn("count",$"Number" + 10)
val result= data1.union(newDF)
Upvotes: 0
Reputation: 143
Dataframes are immutable, you will need to use mutable structure. Here is the solution that might help you.
scala> val value = Array(1.0, 2.0, 55.0)
value: Array[Double] = Array(1.0, 2.0, 55.0)
scala> import scala.collection.mutable.ListBuffer
import scala.collection.mutable.ListBuffer
scala> var data = new ListBuffer[(Double, Double)]
data: scala.collection.mutable.ListBuffer[(Double, Double)] = ListBuffer()
scala> for(a <- value)
| {
| val num = a
| val count = a+10
| data += ((num.asInstanceOf[Double], count.asInstanceOf[Double]))
| println(data)
| }
ListBuffer((1.0,11.0))
ListBuffer((1.0,11.0), (2.0,12.0))
ListBuffer((1.0,11.0), (2.0,12.0), (55.0,65.0))
scala> val DF = spark.sparkContext.parallelize(data).toDF("Number","count")
DF: org.apache.spark.sql.DataFrame = [Number: double, count: double]
scala> DF.show()
+------+-----+
|Number|count|
+------+-----+
| 1.0| 11.0|
| 2.0| 12.0|
| 55.0| 65.0|
+------+-----+
scala>
Upvotes: 1
Reputation: 32720
Just create one DataFrame using the for-loop and then union with data1
like this:
val df = ( for(a <- values) yield (a, a+10) ).toDF("Number", "count")
val result = data1.union(df)
This would be much more efficient than doing unions inside the for-loop.
Upvotes: 0