Thomas Speidel
Thomas Speidel

Reputation: 1443

Relevel factors based on values of another variable

I'd like to relevel a factor variable based on the value of another variable. For instance:

factors <- structure(list(color = c("RED", "GREEN", "BLUE", "YELLOW", "BROWN"
), count = c(2, 5, 11, 1, 19)), row.names = c(NA, -5L), class = c("tbl_df", 
"tbl", "data.frame"))

> factors
# A tibble: 5 x 2
  color  count
  <chr>  <dbl>
1 RED        2
2 GREEN      5
3 BLUE      11
4 YELLOW     1
5 BROWN     19


Here's what I want to produce:

##Group all levels with count < 10 into "OTHER"

> factors.out
# A tibble: 3 x 2
  color count
  <chr> <dbl>
1 OTHER     8
2 BLUE     11
3 BROWN    19


I thought this was a job for forcats::fct_lump():

##Keep 3 levels
factors %>%
+   mutate(color = fct_lump(color, n = 3))
# A tibble: 5 x 2
  color  count
  <fct>  <dbl>
1 RED        2
2 GREEN      5
3 BLUE      11
4 YELLOW     1
5 BROWN     19


I know one can do that with something like:

factors %>%
  mutate(color = ifelse(count < 10, "OTHER", color)) %>%
  group_by(color) %>%
  summarise(count = sum(count))


But I thought or was hoping there was a convenience function in forcats.


Upvotes: 1

Views: 679

Answers (1)

user666993
user666993

Reputation:

Because you already have a data.frame containing factors and counts, you can use the counts as weights when lumping together the most rare observations. The second stage just involves collapsing the OTHER category like in your example.

factors <- structure(list(color = c("RED", "GREEN", "BLUE", "YELLOW", "BROWN"),
  count = c(2, 5, 11, 1, 19)), row.names = c(NA, -5L), class = c("tbl_df", 
  "tbl", "data.frame"))

library("dplyr")
library("forcats")

factors.out <- factors %>%
  mutate(color = fct_lump(color, n = 2, other_level = "OTHER",
    w = count)) %>%
  group_by(color) %>%
  summarise(count = sum(count)) %>%
  arrange(count)

giving

factors.out 
# A tibble: 3 x 2
  color count
  <fct>  <dbl>
1 OTHER     8
2 BLUE     11
3 BROWN    19

Upvotes: 2

Related Questions