Reputation: 320
I am trying to write a function that will look at all the factor variables in a data frame and combine all the levels together but only if a level contains less observations then a certain percentage/threshold.
So far I have a function that works on a single variable but I am trying to apply that function to all the factors in the data frame. When I try to apply it to all the factors variables, I get an error
# Code to create data frame (df)
var <- factor(c(a <- rep("a", 100), b <- rep("b", 1000), c <- rep("c", 1000), d <- rep("d", 1000), e <- rep("e", 400), f <- rep("f", 100)))
var1 <- factor(c(a1 <- rep("a", 100), b1 <- rep("b", 400), c1 <- rep("c", 1000), d1 <- rep("d", 1000), e1 <- rep("e", 1000), f <- rep("f", 100)))
x_df <- data.frame(var = var, var1 = var1)
str(x_df)
# check the count of each level
sapply(x_df, function(x){
table(x)
})
# create the function
Merge.factors <- function(x, p) {
#Combines factor levels in x that are less than a specified proportion, p.
t <- table(x)
less <- subset(t, prop.table(t) < p)
more <- subset(t, prop.table(t) >= p)
other <- rep("Other", sum(less))
new.table <- c(more, table(other))
new.x <- as.factor(rep(names(new.table), new.table))
return(new.x)
}
# applying the function to a single factor variable - It works!
# This is the expected result
Merge.factors(x_df$var, 0.15)
Now I have tried two ways to apply this function to all the factor variables
# First method:
sapply(x_df, Merge.factors(0.15)) # Give an error, argument P is missing
# 2nd Method:
for (i in 1:ncol(x_df)) {
x_df[,i] <- Merge.factors(i, 0.15)
}
Any help will be greatly appreciated
Upvotes: 1
Views: 454
Reputation: 388982
In your current function you need to pass the threshold as a different argument
x_df[] <- lapply(x_df, Merge.factors, 0.15)
#Or to be more specific
#x_df[] <- lapply(x_df, function(x) Merge.factors(x, 0.15))
Now check
lapply(x_df, table)
#$var
# b c d Other
# 1000 1000 1000 600
#$var1
# c d e Other
# 1000 1000 1000 600
To exclude certain factors we can change the function to
Merge.factors <- function(x, p) {
t <- table(x)
less <- subset(t, prop.table(t) < p & names(t) != 'e')
more <- subset(t, prop.table(t) >= p | names(t) == "e")
other <- rep("Other", sum(less))
new.table <- c(more, table(other))
new.x <- as.factor(rep(names(new.table), new.table))
return(new.x)
}
x_df[] <- lapply(x_df, Merge.factors, 0.15)
lapply(x_df, table)
#$var
# b c d e Other
# 1000 1000 1000 400 200
#$var1
# c d e Other
# 1000 1000 1000 600
Upvotes: 1
Reputation: 320
I changed the function slightly and hard coded the threshold. Below is the new function, followed by the application of the function to all columns:
# create the function
Merge_factors <- function(x) {
t <- table(x)
less <- subset(t, prop.table(t) < 0.15)
more <- subset(t, prop.table(t) >= 0.15)
other <- rep("Other", sum(less))
new.table <- c(more, table(other))
new.x <- as.factor(rep(names(new.table), new.table))
return(new.x)
}
xs_df <- as.data.frame(sapply(x_df, Merge_factors))
Upvotes: 0