Reputation: 25
I have a dynamically created Spark Dataframe where I need to filter the Dataframe when any of the columns are "False" and store it in one table and store the row where none of the columns are false in one table. Column names and number of columns will never be known.
For example, if my table is
Col1 Col2 Col3
Row1 True False False
Row2 True True True
Row3 False False True
Row4 False False False
Output should be Table 1:
Col1 Col2 Col3
Row1 True False False
Row3 False False True
Row4 False False False
and Table 2
Col1 Col2 Col3
Row2 True True True
I have tried:
val columns: Array[String] = testDF.columns
val seqDfs: Seq[DataFrame] = columns.map(name => df.filter(s"$name == 'False'"))
val output: DataFrame = seqDfs.reduceRight(_ union _)
But it returns a lot of duplicate values and even if I clear the duplicate values, it wouldn't help me in creating table 2 as all rows in table 2 need to be true.
Any help will be appreciated. Thanks!
Upvotes: 0
Views: 1842
Reputation: 19348
Here's code to create your DataFrame:
val df = Seq(
(true, false, false),
(true, true, true),
(false, false, true),
(false, false, false)
).toDF("col1", "col2", "col3")
Let's append an all_true
column:
val columns = df.columns
val df2 = df.withColumn("all_true", columns.map(col(_).===(true)).reduceLeft(_.&&(_)))
Here's the DataFrame with all true values:
df2.where($"all_true" === true).drop("all_true").show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
|true|true|true|
+----+----+----+
Here's the DataFrame with values that are not all true:
df2.where($"all_true" === false).drop("all_true").show()
+-----+-----+-----+
| col1| col2| col3|
+-----+-----+-----+
| true|false|false|
|false|false| true|
|false|false|false|
+-----+-----+-----+
Great question, welcome to StackOverflow ;)
BTW, spark-daria has a multiEquals()
function where I grabbed this code, see this file.
Upvotes: 2