drop all columns with a special condition on a column spark

Question

I have a dataset and I need to drop columns which has a standard deviation equal to 0. I've tried:

val df = spark.read.option("header",true)
  .option("inferSchema", "false").csv("C:/gg.csv")

val finalresult = df
  .agg(df.columns.map(stddev(_)).head, df.columns.map(stddev(_)).tail: _*)

I want to compute the standard deviation of each column and drop the column if it it is is equal to zero

RowNumber,Poids,Age,Taille,0MI,Hmean,CoocParam,LdpParam,Test2,Classe,
0,87,72,160,5,0.6993,2.9421,2.3745,3,4,
1,54,70,163,5,0.6301,2.7273,2.2205,3,4,
2,72,51,164,5,0.6551,2.9834,2.3993,3,4,
3,75,74,170,5,0.6966,2.9654,2.3699,3,4,
4,108,62,165,5,0.6087,2.7093,2.1619,3,4,
5,84,61,159,5,0.6876,2.938,2.3601,3,4,
6,89,64,168,5,0.6757,2.9547,2.3676,3,4,
7,75,72,160,5,0.7432,2.9331,2.3339,3,4,
8,64,62,153,5,0.6505,2.7676,2.2255,3,4,
9,82,58,159,5,0.6748,2.992,2.4043,3,4,
10,67,49,160,5,0.6633,2.9367,2.333,3,4,
11,85,53,160,5,0.6821,2.981,2.3822,3,4,

akuiper · Accepted Answer

You can try this, use getValueMap and filter to get the column names which you want to drop, and then drop them:

//Extract the standard deviation from the data frame summary:    
val stddev = df.describe().filter($"summary" === "stddev").drop("summary").first()

// Use `getValuesMap` and `filter` to get the columns names where stddev is equal to 0:    
val to_drop = stddev.getValuesMap[String](df.columns).filter{ case (k, v) => v.toDouble == 0 }.keys

//Drop 0 stddev columns    
df.drop(to_drop.toSeq: _*).show
+---------+-----+---+------+------+---------+--------+
|RowNumber|Poids|Age|Taille| Hmean|CoocParam|LdpParam|
+---------+-----+---+------+------+---------+--------+
|        0|   87| 72|   160|0.6993|   2.9421|  2.3745|
|        1|   54| 70|   163|0.6301|   2.7273|  2.2205|
|        2|   72| 51|   164|0.6551|   2.9834|  2.3993|
|        3|   75| 74|   170|0.6966|   2.9654|  2.3699|
|        4|  108| 62|   165|0.6087|   2.7093|  2.1619|
|        5|   84| 61|   159|0.6876|    2.938|  2.3601|
|        6|   89| 64|   168|0.6757|   2.9547|  2.3676|
|        7|   75| 72|   160|0.7432|   2.9331|  2.3339|
|        8|   64| 62|   153|0.6505|   2.7676|  2.2255|
|        9|   82| 58|   159|0.6748|    2.992|  2.4043|
|       10|   67| 49|   160|0.6633|   2.9367|   2.333|
|       11|   85| 53|   160|0.6821|    2.981|  2.3822|
+---------+-----+---+------+------+---------+--------+

drop all columns with a special condition on a column spark

Answers (2)

Related Questions