Reputation: 175
Trying to collapse a nominal categorical vector by combining low frequency counts into an 'Other' category:
The data (column of a dataframe) looks like this, and contains information for all 50 states:
California
Florida
Alabama
...
table(colname)/length(colname)
correctly returns the frequencies, and what I'm trying to do is to lump anything below a given threshold (say f=0.02) together. What is the correct approach?
Upvotes: 6
Views: 4679
Reputation: 5167
A little late to the game, but you may use my package DataExplorer. The group_category
function is exactly what you are looking for. There are other options too, you can type ?group_category
to find out more.
For example, in your case:
library(DataExplorer)
group_category(data, "colname", 0.02, update = TRUE)
Here are more examples.
Upvotes: 4
Reputation: 42572
Hadley Wickham's forcats
package (available on CRAN since 2016-08-29) has a handy function fct_lump()
which lumps together levels of a factor according to different criteria.
OP's requirement to lump together factors below a threshold of 0.02 can be achieved by
set.seed(1)
a <- sample(c("A", "B", "C", "D", "E", sample(letters[1:10], 55, TRUE)))
forcats::fct_lump(a, prop = 0.02)
[1] c d d e j h c h g i g d [13] f Other g h h a b h e g h b [25] d e e g i f d e g c g a [37] Other i i b i j f d c h Other j [49] j c Other e f a a h e c Other b Levels: a b c d e f g h i j Other
Note that the sample data from this answer has been used for comparison.
The function offers even more possibilities, e.g., it can keep the 5 factor levels with the lowest frequencies and lumps together the other levels:
forcats::fct_lump(a, n = -5)
[1] Other Other Other Other Other Other Other Other Other Other Other Other [13] Other D Other Other Other Other Other Other Other Other Other Other [25] Other Other Other Other Other Other Other Other Other Other Other Other [37] B Other Other Other Other Other Other Other Other Other E Other [49] Other Other C Other Other Other Other Other Other Other A Other Levels: A B C D E Other
Upvotes: 6
Reputation: 21
I used an upadated version of the condense me function:
condenseMe <- function(vector, name, limit) {
toCondense <- names(which(prop.table(table(vector)) < limit))
levels(vector)[levels(vector) %in% toCondense] <- name
vector
}
Note: If among the levels there is NA set as level, in some cases condenseMe function will replace NA level with NA as missing values. That´s what happened to me.
Upvotes: 2
Reputation: 193667
From the sounds of it, something like the following should work for you:
condenseMe <- function(vector, threshold = 0.02, newName = "Other") {
toCondense <- names(which(prop.table(table(vector)) < threshold))
vector[vector %in% toCondense] <- newName
vector
}
Try it out:
## Sample data
set.seed(1)
a <- sample(c("A", "B", "C", "D", "E", sample(letters[1:10], 55, TRUE)))
round(prop.table(table(a)), 2)
# a
# a A b B c C d D e E f g h
# 0.07 0.02 0.07 0.02 0.10 0.02 0.10 0.02 0.12 0.02 0.07 0.12 0.13
# i j
# 0.08 0.07
a
# [1] "c" "d" "d" "e" "j" "h" "c" "h" "g" "i" "g" "d" "f" "D" "g" "h"
# [17] "h" "a" "b" "h" "e" "g" "h" "b" "d" "e" "e" "g" "i" "f" "d" "e"
# [33] "g" "c" "g" "a" "B" "i" "i" "b" "i" "j" "f" "d" "c" "h" "E" "j"
# [49] "j" "c" "C" "e" "f" "a" "a" "h" "e" "c" "A" "b"
condenseMe(a)
# [1] "c" "d" "d" "e" "j" "h" "c" "h"
# [9] "g" "i" "g" "d" "f" "Other" "g" "h"
# [17] "h" "a" "b" "h" "e" "g" "h" "b"
# [25] "d" "e" "e" "g" "i" "f" "d" "e"
# [33] "g" "c" "g" "a" "Other" "i" "i" "b"
# [41] "i" "j" "f" "d" "c" "h" "Other" "j"
# [49] "j" "c" "Other" "e" "f" "a" "a" "h"
# [57] "e" "c" "Other" "b"
Note, however, that if you are dealing with factor
s, you should convert them with as.character
first.
Upvotes: 11
Reputation: 17279
I'm including an option that uses makes the change to the factor levels attribute as an alternative to Ananda Mahto's answer. The efficiency (based on a factor of 10,000 observations with 10 factors) isn't much different (about 150 microseconds), so your choice of option would likely depend on if you want to work with factors or with strings. Personally, I'd choose to work with strings and would use Ananda's approach.
#* Using a list assignment to levels
combineByThreshold <- function(x, threshold = .02, label = "Other")
{
prop <- prop.table(table(x))
combine <- which(prop < threshold)
levels(x) <- c(mapply(identity, levels(x)[-combine], SIMPLIFY = FALSE, USE.NAMES=TRUE),
setNames(list(levels(x)[combine]),
"Other"))
x
}
#* Ananda Mahto
condenseMe <- function(vector, threshold = 0.02, newName = "Other") {
toCondense <- names(which(prop.table(table(vector)) < 0.02))
vector[vector %in% toCondense] <- newName
vector
}
#* F.R.
collapsecategory <- function(x, p) {
levels_len = length(levels(x))
levels(x)[levels_len+1] = 'Other'
y = table(x)/length(x)
y1 = as.vector(y)
y2 = names(y)
y2_len = length(y2)
for (i in 1:y2_len) {
if (y1[i]<=p){
x[x==y2[i]] = 'Other'
}
}
x <- droplevels(x)
x
}
#* Steven Le
filtered_data <- State %>% group_by(ID) %>% summarise(n = n(),
freq = n/nrow(State),
above_thresh = freq > 0.2)
filtered_data$State[filtered_data$above_thres == TRUE] <- "above_0.2"
#**************************
#* Comparison of methods
library(microbenchmark)
library(dplyr)
x <- sample(LETTERS[1:10], 10000,
prob = rep(c(1, 20), c(4, 6)),
replace = TRUE)
x <- as.factor(x)
DF <- data.frame(x = x) %>%
mutate(x = as.character(x),
orig = x)
microbenchmark(
combineByThreshold = combineByThreshold(x),
condenseMe = condenseMe(as.character(x)),
collapsecategory = collapsecategory(x, .02),
dplyr = {Ref <- DF %>%
group_by(orig, x) %>%
summarise(n = n(),
freq = n/nrow(DF),
above_thresh = freq < 0.02)
Ref$x[Ref$above_thres == TRUE] <- "Other"
DF <- left_join(DF[, "orig", drop=FALSE], Ref[, c("orig", "x")],
by = c("orig" = "orig"))
}
)
Unit: microseconds
expr min lq mean median uq max neval
combineByThreshold 694.108 715.0740 872.8452 735.4550 771.5235 3403.971 100
condenseMe 756.275 771.9635 893.2784 789.5585 827.9730 3753.223 100
collapsecategory 3679.032 3713.1955 4156.7700 3772.1375 3949.4030 6852.219 100
dplyr 4101.596 4229.0105 4515.9785 4408.6220 4602.4560 7772.711 100
Brief disclaimer: I may not have represented Steven Le's answer fairly here. I interpreted his answer as generating a reference table of the new factor levels that could be merged back into the original data. Someone please correct me if I've misinterpreted.
Upvotes: 0
Reputation: 748
Using the package dplyr, and assuming your data frame (let's call it State) has one field called ID for each State name...
filtered_data <- State %>% group_by(ID) %>% summarise(n = n(),
freq = n/nrow(State),
above_thresh = freq > 0.2)
filtered_data$State[filtered_data$above_thres == TRUE] <- "above_0.2"
effectively what this does is gives the state name of anything with a frequency of 0.2, the label "above_0.2".
Upvotes: 1
Reputation: 175
Seems to work, but it's quite ugly. Is there a more elegant solution?
collapsecatetgory <- function(x, p) {
levels_len = length(levels(x))
levels(x)[levels_len+1] = 'Other'
y = table(x)/length(x)
y1 = as.vector(y)
y2 = names(y)
y2_len = length(y2)
for (i in 1:y2_len) {
if (y1[i]<=p){
x[x==y2[i]] = 'Other'
}
}
x <- droplevels(x)
x
}
Upvotes: 1