Remove char columns with little variation

Question

In a dataframe like this

Temp=c(26:30)
Feels=c("cold","cold","cold","hot","hot")
Time=c("night","night","night","night","day")
Year=as.character(c(2011,2015,2010,2015,2015))
DF=data.frame(Temp,Feels,Time,Year)

I would like to remove all categorical / numeric columns with little variation. eg when the proportion of the most frequent value is >80%, like in "Time" and "Year". Any ideas with dplyr or other methods?

Artem · Accepted Answer

You can use dplyr's select_if and remove weakly variable columns:

Feels <- c("cold", "cold", "cold", "hot", "hot")
Time <- c("night", "night", "night", "night", "day")
Year <- as.character(c(2011, 2015, 2010, 2015, 2015))
df <- data.frame(Temp = rnorm(length(Feels)), Feels, Time, Year)
library(tidyverse)
threshold <- .8
is_variable <- function(x) {
  !(max(table(x) / sum(table(x))) >= threshold)  
}

df %>% select_if(is_variable)

Output:

> df %>% select_if(is_variable)
        Temp Feels Year
1 -0.6747300  cold 2011
2 -0.4504562  cold 2015
3 -1.4048543  cold 2010
4 -1.7972187   hot 2015
5 -0.6955777   hot 2015

Remove char columns with little variation

Answers (1)

Related Questions