user7394882
user7394882

Reputation: 193

drop all columns with a special condition on a column spark

I have a dataset and I need to drop columns which has a standard deviation equal to 0. I've tried:

val df = spark.read.option("header",true)
  .option("inferSchema", "false").csv("C:/gg.csv")

val finalresult = df
  .agg(df.columns.map(stddev(_)).head, df.columns.map(stddev(_)).tail: _*)

I want to compute the standard deviation of each column and drop the column if it it is is equal to zero

RowNumber,Poids,Age,Taille,0MI,Hmean,CoocParam,LdpParam,Test2,Classe,
0,87,72,160,5,0.6993,2.9421,2.3745,3,4,
1,54,70,163,5,0.6301,2.7273,2.2205,3,4,
2,72,51,164,5,0.6551,2.9834,2.3993,3,4,
3,75,74,170,5,0.6966,2.9654,2.3699,3,4,
4,108,62,165,5,0.6087,2.7093,2.1619,3,4,
5,84,61,159,5,0.6876,2.938,2.3601,3,4,
6,89,64,168,5,0.6757,2.9547,2.3676,3,4,
7,75,72,160,5,0.7432,2.9331,2.3339,3,4,
8,64,62,153,5,0.6505,2.7676,2.2255,3,4,
9,82,58,159,5,0.6748,2.992,2.4043,3,4,
10,67,49,160,5,0.6633,2.9367,2.333,3,4,
11,85,53,160,5,0.6821,2.981,2.3822,3,4,

Upvotes: 2

Views: 5412

Answers (2)

eliasah
eliasah

Reputation: 40370

OK, I have written a solution that is independent of your dataset. Required imports and example data:

import org.apache.spark.sql.Column
import org.apache.spark.sql.functions.{lit, stddev, col}

val df = spark.range(1, 1000).withColumn("X2", lit(0)).toDF("X1","X2")
df.show(5)

// +---+---+
// | X1| X2|
// +---+---+
// |  1|  0|
// |  2|  0|
// |  3|  0|
// |  4|  0|
// |  5|  0|

First compute standard deviation by column:

// no need to rename but I did it to become more human 
// readable when you show df2
val aggs = df.columns.map(c => stddev(c).as(c)) 
val stddevs = df.select(aggs: _*)

stddevs.show // df2 contains the stddev of each columns
// +-----------------+---+
// |               X1| X2|
// +-----------------+---+
// |288.5307609250702|0.0|
// +-----------------+---+

Collect the first row and filter columns to keep:

val columnsToKeep: Seq[Column] = stddevs.first  // Take first row
  .toSeq  // convert to Seq[Any]
  .zip(df.columns)  // zip with column names
  .collect {
    // keep only names where stddev != 0
    case (s: Double, c) if s != 0.0  => col(c) 
  }

Select and check the results:

df.select(columnsToKeep: _*).show

// +---+
// | X1|
// +---+
// |  1|
// |  2|
// |  3|
// |  4|
// |  5|

Upvotes: 4

akuiper
akuiper

Reputation: 214957

You can try this, use getValueMap and filter to get the column names which you want to drop, and then drop them:

//Extract the standard deviation from the data frame summary:    
val stddev = df.describe().filter($"summary" === "stddev").drop("summary").first()

// Use `getValuesMap` and `filter` to get the columns names where stddev is equal to 0:    
val to_drop = stddev.getValuesMap[String](df.columns).filter{ case (k, v) => v.toDouble == 0 }.keys

//Drop 0 stddev columns    
df.drop(to_drop.toSeq: _*).show
+---------+-----+---+------+------+---------+--------+
|RowNumber|Poids|Age|Taille| Hmean|CoocParam|LdpParam|
+---------+-----+---+------+------+---------+--------+
|        0|   87| 72|   160|0.6993|   2.9421|  2.3745|
|        1|   54| 70|   163|0.6301|   2.7273|  2.2205|
|        2|   72| 51|   164|0.6551|   2.9834|  2.3993|
|        3|   75| 74|   170|0.6966|   2.9654|  2.3699|
|        4|  108| 62|   165|0.6087|   2.7093|  2.1619|
|        5|   84| 61|   159|0.6876|    2.938|  2.3601|
|        6|   89| 64|   168|0.6757|   2.9547|  2.3676|
|        7|   75| 72|   160|0.7432|   2.9331|  2.3339|
|        8|   64| 62|   153|0.6505|   2.7676|  2.2255|
|        9|   82| 58|   159|0.6748|    2.992|  2.4043|
|       10|   67| 49|   160|0.6633|   2.9367|   2.333|
|       11|   85| 53|   160|0.6821|    2.981|  2.3822|
+---------+-----+---+------+------+---------+--------+

Upvotes: 4

Related Questions