Reputation: 163
df1:
a = c(2, 3, 5, 8, 10, 12)
b = c("NA", "bb", "cc", "aa", "bb", "aa")
c = c("bb", "aa", "bb", "cc", "aa", "aa")
d = c("aa", "cc", "bb", "aa", "aa", "aa")
e = c(TRUE, FALSE, TRUE, FALSE, TRUE, FALSE)
df1 = data.frame(a, b, c, d, e)
Looking to evaluate the proportion of all values combined in b, c, d and then change any category with a proportion below 20% to "Rare".
Output:
a b c d e
2 NA bb aa true
3 bb aa rare false
5 rare bb bb true
8 aa rare aa false
10 bb aa aa true
12 aa aa aa false
Upvotes: 1
Views: 37
Reputation: 145985
Here's a base R approach:
# convert string "NA" to actual missing values NA
df1[df1 == "NA"] = NA
cols = c("b", "c", "d")
freq = prop.table(table(unlist(df1[cols])))
make_rare = names(freq)[freq < 0.2]
df1[cols] = lapply(df1[cols], function(x) replace(x, x %in% make_rare, "rare"))
df1
# a b c d e
# 1 2 <NA> bb aa TRUE
# 2 3 bb aa rare FALSE
# 3 5 rare bb bb TRUE
# 4 8 aa rare aa FALSE
# 5 10 bb aa aa TRUE
# 6 12 aa aa aa FALSE
Upvotes: 1