SAJ
SAJ

Reputation: 320

Combine factor variable levels with less observations together. for all factor variables in a data frame

I am trying to write a function that will look at all the factor variables in a data frame and combine all the levels together but only if a level contains less observations then a certain percentage/threshold.

So far I have a function that works on a single variable but I am trying to apply that function to all the factors in the data frame. When I try to apply it to all the factors variables, I get an error

# Code to create data frame (df)
var <- factor(c(a <- rep("a", 100), b <- rep("b", 1000), c <- rep("c", 1000), d <- rep("d", 1000), e <- rep("e", 400), f <- rep("f", 100)))
var1 <- factor(c(a1 <- rep("a", 100), b1 <- rep("b", 400), c1 <- rep("c", 1000), d1 <- rep("d", 1000), e1 <- rep("e", 1000), f <- rep("f", 100)))
x_df <- data.frame(var = var, var1 = var1)
str(x_df)


# check the count of each level 
sapply(x_df, function(x){
  table(x)
})

# create the function 
Merge.factors <- function(x, p) { 
  #Combines factor levels in x that are less than a specified proportion, p.
  t <- table(x)                 
  less <- subset(t, prop.table(t) < p)
  more <- subset(t, prop.table(t) >= p)
  other <- rep("Other", sum(less))
  new.table <- c(more, table(other))
  new.x <- as.factor(rep(names(new.table), new.table))
  return(new.x)
}

# applying the function to a single factor variable -  It works!
# This is the expected result 
Merge.factors(x_df$var, 0.15) 

Now I have tried two ways to apply this function to all the factor variables

# First method: 
sapply(x_df, Merge.factors(0.15))   # Give an error, argument P is missing

# 2nd Method:
for (i in 1:ncol(x_df)) {
  x_df[,i] <- Merge.factors(i, 0.15)
}

Any help will be greatly appreciated

Upvotes: 1

Views: 454

Answers (2)

Ronak Shah
Ronak Shah

Reputation: 388982

In your current function you need to pass the threshold as a different argument

x_df[] <- lapply(x_df, Merge.factors, 0.15)
#Or to be more specific
#x_df[] <- lapply(x_df, function(x) Merge.factors(x, 0.15))

Now check

lapply(x_df, table)

#$var
#    b     c     d Other 
# 1000  1000  1000   600 

#$var1
#    c     d     e Other 
# 1000  1000  1000   600 

To exclude certain factors we can change the function to

Merge.factors <- function(x, p) { 
  t <- table(x)                 
  less <- subset(t, prop.table(t) < p & names(t) != 'e')
  more <- subset(t, prop.table(t) >= p | names(t) == "e")
  other <- rep("Other", sum(less))
  new.table <- c(more, table(other))
  new.x <- as.factor(rep(names(new.table), new.table))
  return(new.x)
}

x_df[] <- lapply(x_df, Merge.factors, 0.15)
lapply(x_df, table)

#$var
#    b     c     d     e Other 
# 1000  1000  1000   400   200 

#$var1
#    c     d     e Other 
# 1000  1000  1000   600 

Upvotes: 1

SAJ
SAJ

Reputation: 320

I changed the function slightly and hard coded the threshold. Below is the new function, followed by the application of the function to all columns:

# create the function 
Merge_factors <- function(x) { 
  t <- table(x)                 
  less <- subset(t, prop.table(t) < 0.15)
  more <- subset(t, prop.table(t) >= 0.15)
  other <- rep("Other", sum(less))
  new.table <- c(more, table(other))
  new.x <- as.factor(rep(names(new.table), new.table))
  return(new.x)
}

xs_df <- as.data.frame(sapply(x_df, Merge_factors))

Upvotes: 0

Related Questions