Reputation: 28169
I have a data frame and I'm trying to take a factor variable and keep only the top 31 levels and make all the other levels some generic level.
I need to do this across several vectors so I figured I'd create function, but I'm not having much luck. I think I need to somehow use mapply
or Vectorize
but I don't think I'm doing it properly as I get error messages about being unable to allocate 3.6 gigs of memory.
This is the function where x is the vector and topCount is the number of levels to keep
createFactor <-function(x, topCount){
table1 <- data.frame(table(x))
table1 <- table1[order(-table1$Freq),]
noChange <- table1$Var1[1:topCount]
newVals1 <- factor(ifelse(x %in% noChange, x, "-1000"))
newVals1
}
I'd like to be able to write something like this:
df1$topLevels <- createFactor(df1$fact1, 31)
Any suggestions ?
Upvotes: 0
Views: 1414
Reputation: 173717
I'm not completely certain about the performance characteristics of this, but I probably would have written this function more like so:
topK <- function(x,k){
tbl <- tabulate(x)
names(tbl) <- levels(x)
x <- as.character(x)
levelsToKeep <- names(tail(sort(tbl),k))
x[!(x %in% levelsToKeep)] <- '-1000'
factor(x)
}
where I've used tabulate
rather than table
because I suspect is may be faster (which seems important in your case) although I haven't tested this to see how much faster it would actually be.
Upvotes: 3