Priyanka
Priyanka

Reputation: 51

How to find out only those column names where data is having particular value in scala

I have a DF dataframe which has below values

Email   NAME       DESIGNATION
MATCHED MISMATCHED MISMATCHED
MATCHED MISMATCHED MATCHED

I was able to find out the columns in DF dataframe using val comparecols = Df.columns

I want to find out those columns which has value MISMATCHED. Here in my example NAME and DESIGNATION. Since my dataframe is no consistent I am not sure how shall I use map and filter together

I have tried below

for (name <- comparecols){
df =df.filter(s"$name ='MATCHED'") 

But its giving me result row wise, i.e. its checking row at a time and giving all the columns.

However I want to check column wise.

Upvotes: 1

Views: 201

Answers (1)

mck
mck

Reputation: 42392

You can do a conditional sum/count:

val df2 = df.select(
    df.columns.map(
        c => sum(when(col(c) === "MISMATCHED", 1).otherwise(0)).as(c)
    ): _*
)

df2.show
+-----+----+-----------+
|Email|NAME|DESIGNATION|
+-----+----+-----------+
|    0|   2|          1|
+-----+----+-----------+

If you want to get the column names, you can do

val colnames = df2.select(
    arrays_zip(array(df.columns.map(lit):_*), array(df.columns.map(col):_*)).as("c")
).selectExpr("inline(c) as (col, val)").filter("val != 0").select("col").rdd.map(_.getString(0)).collect

Upvotes: 2

Related Questions