Aris Kantas
Aris Kantas

Reputation: 497

How to convert a Dataframe into a List (Scala)?

I want to convert a Dataframe which contains Double values into a List so that I can use it to make calculations. What is your suggestion so that I can take a correct type List (i.e. Double) ?

My approach is this :

var newList = myDataFrame.collect().toList 

but it returns a type List[org.apache.spark.sql.Row] which I don't know what it is exactly!

Is it possible to forget that step and simply pass my Dataframe inside a function and make calculation from it? (For example I want to compare the third element of its second column with a specific double. Is it possible to do so directly from my Dataframe?)

At any cost I have to understand how to create the right type List each time!

EDIT:

Input Dataframe:

+---+---+ 
|_c1|_c2|
+---+---+ 
|0  |0  | 
|8  |2  | 
|9  |1  | 
|2  |9  | 
|2  |4  | 
|4  |6  | 
|3  |5  | 
|5  |3  | 
|5  |9  | 
|0  |1  | 
|8  |9  | 
|1  |0  | 
|3  |4  |
|8  |7  | 
|4  |9  | 
|2  |5  | 
|1  |9  | 
|3  |6  |
+---+---+

Result after conversion:

List((0,0), (8,2), (9,1), (2,9), (2,4), (4,6), (3,5), (5,3), (5,9), (0,1), (8,9), (1,0), (3,4), (8,7), (4,9), (2,5), (1,9), (3,6))

But every element in the List has to be Double type.

Upvotes: 0

Views: 10336

Answers (3)

koiralo
koiralo

Reputation: 23109

You can cast the coulmn you need to Double and convert it to RDD and collect it

If you have data that cannot be parsed then you can use udf to clean before casting it to double

val stringToDouble = udf((data: String) => {
  Try (data.toDouble) match {
    case Success(value) => value
    case Failure(exception) => Double.NaN
  }
})

 val df = Seq(
   ("0.000","0"),
   ("0.000008","24"),
   ("9.00000","1"),
   ("-2","xyz"),
   ("2adsfas","1.1.1")
 ).toDF("a", "b")
  .withColumn("a", stringToDouble($"a").cast(DoubleType))
  .withColumn("b", stringToDouble($"b").cast(DoubleType))

After this you will get output as

+------+----+
|a     |b   |
+------+----+
|0.0   |0.0 |
|8.0E-6|24.0|
|9.0   |1.0 |
|-2.0  |NaN |
|NaN   |NaN |
+------+----+

To get Array[(Double, Double)]

val result = df.rdd.map(row => (row.getDouble(0), row.getDouble(1))).collect()

The result will be Array[(Double, Double)]

Upvotes: 4

MIKHIL NAGARALE
MIKHIL NAGARALE

Reputation: 193

#Convert DataFrame to DataSet using case class & then convert it to list

#It'll return the list of type of your class object.All the variables inside the #class(mapping to fields in your table)will be pre-typeCasted) Then you won't need to #type cast every time.

#Please execute below code to check it-
#Sample to check & verify(scala)-

val wa = Array("one","two","two")
val wr = sc.parallelize(wa,3).map(x=>(x,"x",1))
val wdf = wr.toDF("a","b","c")
case class wc(a:String,b:String,c:Int)
val myList= wds.collect.toList
myList.foreach(x=>println(x))
myList.foreach(x=>println(x.a.getClass,x.b.getClass,x.c.getClass))

Upvotes: 0

Muhunthan
Muhunthan

Reputation: 413

myDataFrame.select("_c1", "_c2").collect().map(each => (each.getAs[Double]("_c1"), each.getAs[Double]("_c2"))).toList

Upvotes: -1

Related Questions