Reputation: 461
In a dataframe like this
Temp=c(26:30)
Feels=c("cold","cold","cold","hot","hot")
Time=c("night","night","night","night","day")
Year=as.character(c(2011,2015,2010,2015,2015))
DF=data.frame(Temp,Feels,Time,Year)
I would like to remove all categorical / numeric columns with little variation. eg when the proportion of the most frequent value is >80%, like in "Time" and "Year". Any ideas with dplyr or other methods?
Upvotes: 1
Views: 28
Reputation: 3414
You can use dplyr
's select_if
and remove weakly variable columns:
Feels <- c("cold", "cold", "cold", "hot", "hot")
Time <- c("night", "night", "night", "night", "day")
Year <- as.character(c(2011, 2015, 2010, 2015, 2015))
df <- data.frame(Temp = rnorm(length(Feels)), Feels, Time, Year)
library(tidyverse)
threshold <- .8
is_variable <- function(x) {
!(max(table(x) / sum(table(x))) >= threshold)
}
df %>% select_if(is_variable)
Output:
> df %>% select_if(is_variable)
Temp Feels Year
1 -0.6747300 cold 2011
2 -0.4504562 cold 2015
3 -1.4048543 cold 2010
4 -1.7972187 hot 2015
5 -0.6955777 hot 2015
Upvotes: 1