Reputation: 163
df
a = c("aa", "bb", "cc", "bb", "bb", "cc","bb", "bb", "cc", "cc", "bb", "cc", "bb", "bb", "cc","bb", "bb", "cc", "cc", "bb","bb")
b = c("aa", "bb", "cc", "bb", "bb", "cc","bb", "bb", "cc", "cc", "bb", "cc", "bb", "bb", "cc","bb", "bb", "cc", "cc", "bb","bb")
c = c("aa", "aa", "aa", "bb", "bb", "cc","bb", "bb", "cc", "cc", "bb", "cc", "bb", "bb", "cc","bb", "bb", "cc", "cc", "bb","bb")
d = c(1, 1, 2, 2, 3, 3, 1, 1, 1, 1, 1, 1, 2, 2, 3, 3, 1, 1, 1, 1, 1)
df = data.frame(a,b,c,d)
Col Names:
cols <- c("a","b","c")
Function:
rare_label <- function(x){
freq = prop.table(table(unlist(x)))
make_rare = names(freq)[freq < 0.20]
lapply(x,
function(x) {
replace(x, x %in% make_rare, "Rare")
})}
Looking to evaluate with dplyr::mutate(across())
the proportion of all values combined in a, b, c and then change any category with a proportion below 20% to "Rare".
Output:
a b c
Rare Rare Rare
bb bb Rare
cc cc Rare
bb bb bb
bb bb bb
cc cc cc
bb bb bb
. . .
. . .
. . .
Using the code below throws an error and I'm not sure why.
df %<>%
mutate(across(where(cols), ~rare_label(.)
Error: unexpected symbol in: " mutate(across(where(cols), ~rare_label(.) View"
Upvotes: 2
Views: 3823
Reputation: 388797
Your function is correct but you need to do 2 changes.
lapply
and keep the last line as :replace(x, x %in% make_rare, "Rare")
where
from across
since you are calling column by their names.After doing those changes your code should work.
Another option is to use forcats
package which has functions to do this kind of thing.
library(dplyr)
library(forcats)
df %>%
mutate(across(all_of(cols),fct_lump_min, min = n() * .2, other_level = "rare"))
# a b c d
#1 rare rare rare 1
#2 bb bb rare 1
#3 cc cc rare 2
#4 bb bb bb 2
#5 bb bb bb 3
#6 cc cc cc 3
#7 bb bb bb 1
#8 bb bb bb 1
#9 cc cc cc 1
#10 cc cc cc 1
#11 bb bb bb 1
#12 cc cc cc 1
#13 bb bb bb 2
#...
fct_lump_min
changes all the factors to "rare" which occur less than 20% (0.2 * n()). Here we are passing a number for n
to drop levels, I couldn't find a function which works by passing proportion itself, fct_lump_prop
does something else.
Upvotes: 2
Reputation: 39858
One option could be:
df %>%
mutate(across(all_of(cols),
~ replace(., . %in% names(which(prop.table(table(.)) < 0.20)), "rare")))
a b c d
1 rare rare rare 1
2 bb bb rare 1
3 cc cc rare 2
4 bb bb bb 2
5 bb bb bb 3
6 cc cc cc 3
7 bb bb bb 1
8 bb bb bb 1
9 cc cc cc 1
10 cc cc cc 1
If you want to apply an existing function:
fun <- function(x) replace(x, x %in% names(which(prop.table(table(x)) < 0.20)), "rare")
df %>%
mutate(across(all_of(cols), fun))
Upvotes: 4
Reputation: 39585
Your code works well, just change pipe and values like this:
#Code
df %>%
mutate(across(c(a:c), ~rare_label(.))
Output:
a b c d
1 Rare Rare Rare 1
2 bb bb Rare 1
3 cc cc Rare 2
4 bb bb bb 2
5 bb bb bb 3
6 cc cc cc 3
7 bb bb bb 1
8 bb bb bb 1
9 cc cc cc 1
10 cc cc cc 1
11 bb bb bb 1
12 cc cc cc 1
13 bb bb bb 2
14 bb bb bb 2
15 cc cc cc 3
16 bb bb bb 3
17 bb bb bb 1
18 cc cc cc 1
19 cc cc cc 1
20 bb bb bb 1
21 bb bb bb 1
Upvotes: 5