I have an Array[Row] and I want to turn it into either a Dataset[Row] or DataFrame . How did I come up with an Array of Rows? Well, I was trying to clear nulls from my dataset: without having to filter EACH column (I have a lot) and.. without using the .na.drop() function from DataFrameNaFunctions because it fails to detect when a cell actually has the string "null" . So, I came up with the following line to filter out null in all columns. val outDF = inputDF.columns.flatMap { col => inputDF.filter(col + "!='' AND " + col + "!='null'").collect() } Problem is, outDF is an Array[Row] , hence the question! Any ideas welcome!

scalaapache-sparkapache-spark-sqlspark-streaming

rtcode

Reputation: 103

Spark (Scala): How to turn an Array[Row] into either a DataSet[Row] or a DataFrame?

I have an Array[Row] and I want to turn it into either a Dataset[Row] or DataFrame.

How did I come up with an Array of Rows?

Well, I was trying to clear nulls from my dataset:

without having to filter EACH column (I have a lot) and..
without using the .na.drop() function from DataFrameNaFunctions because it fails to detect when a cell actually has the string "null".

So, I came up with the following line to filter out null in all columns.

val outDF = inputDF.columns.flatMap { col => inputDF.filter(col + "!='' AND " + col + "!='null'").collect() }

Problem is, outDF is an Array[Row], hence the question! Any ideas welcome!

Upvotes: 1

Answers (3)

mrsrinivas

Reputation: 35434

I'm posting the answer as per my comment.

df.na.drop(df.columns).where("'null' not in ("+df.columns.mkString(",")+")")

Upvotes: 3

rtcode

Reputation: 103

This was answered by using the following code, base on Mr Srinivas's comment:

//First drop all typical nulls
val prelimDF = inputDF.na.drop()

//Then drops all columns actually saying 'null'
val finalDF = prelimDF.na.drop(prelimDF.columns).where("'null' not in ("+prelimDF.columns.mkString(",")+")")

Cheers!

Upvotes: 0

zero323

Reputation: 330393

This is what your code would do if it worked:

inputDF.columns.map {
  col => inputDF.filter((inputDF(col) =!= "") and (inputDF(col) =!= "null"))
}.reduce(_ union _)

and something like this:

inputDF.where(inputDF.columns.map {
  col => (inputDF(col) =!= "") and (inputDF(col) =!= "null")
}.foldLeft(lit(true))(_ and _))

is what you want.

Note that the first solution creates non-exclusive subsets so with data like this:

val inputDF = Seq(("1", "a"), ("2", ""), ("null", "")).toDF

the result would be:

+---+---+
| _1| _2|
+---+---+
|  1|  a|
|  2|   |
|  1|  a|
+---+---+

For the solution I believe to be correct:

+---+---+
| _1| _2|
+---+---+
|  1|  a|
+---+---+

Upvotes: 3

Spark (Scala): How to turn an Array[Row] into either a DataSet[Row] or a DataFrame?

Answers (3)

Related Questions