Reputation: 103
I have an Array[Row] and I want to turn it into either a Dataset[Row]
or DataFrame
.
How did I come up with an Array of Rows?
Well, I was trying to clear nulls from my dataset:
.na.drop()
function from DataFrameNaFunctions
because it fails to detect when a cell actually has the string "null"
.So, I came up with the following line to filter out null
in all columns.
val outDF = inputDF.columns.flatMap { col => inputDF.filter(col + "!='' AND " + col + "!='null'").collect() }
Problem is, outDF is an Array[Row]
, hence the question! Any ideas welcome!
Upvotes: 1
Views: 868
Reputation: 35404
I'm posting the answer as per my comment.
df.na.drop(df.columns).where("'null' not in ("+df.columns.mkString(",")+")")
Upvotes: 3
Reputation: 103
This was answered by using the following code, base on Mr Srinivas's comment:
//First drop all typical nulls
val prelimDF = inputDF.na.drop()
//Then drops all columns actually saying 'null'
val finalDF = prelimDF.na.drop(prelimDF.columns).where("'null' not in ("+prelimDF.columns.mkString(",")+")")
Cheers!
Upvotes: 0
Reputation: 330063
This is what your code would do if it worked:
inputDF.columns.map {
col => inputDF.filter((inputDF(col) =!= "") and (inputDF(col) =!= "null"))
}.reduce(_ union _)
and something like this:
inputDF.where(inputDF.columns.map {
col => (inputDF(col) =!= "") and (inputDF(col) =!= "null")
}.foldLeft(lit(true))(_ and _))
is what you want.
Note that the first solution creates non-exclusive subsets so with data like this:
val inputDF = Seq(("1", "a"), ("2", ""), ("null", "")).toDF
the result would be:
+---+---+
| _1| _2|
+---+---+
| 1| a|
| 2| |
| 1| a|
+---+---+
For the solution I believe to be correct:
+---+---+
| _1| _2|
+---+---+
| 1| a|
+---+---+
Upvotes: 3