Determining most/least amount of occurrences within subset row & column group in a data frame

Question

I am trying to find the most and least amount of items within a row / column group in a larger data frame. Here is the data to make it clearer:

df <- data.frame(matrix(nrow = 8, ncol = 3))
df$X1 <- c(1, 1, 1, 2, 2, 3, 3, 3)
df$X2 <- c("yellow", "green", "yellow", "blue", NA, "orange", NA, "orange") 
df$X3 <- c("green", "yellow", NA, "blue", "red", "purple" , "orange", NA) 
names(df) <- c("group", "A", "B")

Here is what that looks like (I have NAs in the original data, so I've included them):

  group      A      B
1     1 yellow  green
2     1  green yellow
3     1 yellow   
4     2   blue   blue
5     2       red
6     3 orange purple
7     3    orange
8     3 orange

In the first "group", for instance, I want to determine which color occurs the most and which color occurs the least. Something that looks like this:

  group      A      B   most  least
1     1 yellow  green yellow  green
2     1  green yellow yellow  green
3     1 yellow    yellow  green
4     2   blue   blue   blue    red
5     2       red   blue    red
6     3 orange purple orange purple
7     3    orange orange purple
8     3 orange    orange purple

I am working within a dplyr chain in the original data so I can group_by "group", but I am having a hard time figuring out a method that allows me to work within a "cluster" of two columns with differing numbers of rows. I do not need this to be done with dplyr, but I figured it might be easiest given the usefulness of group_by. Additionally, I need the result to somehow remain in the original data frame as new columns. Any suggestions?

www · Accepted Answer

A solution uses dplyr and tidyr. The strategy is to find the "most" and "least" item and prepare a new data frame. After that, use the right_join to merge the original data frame and prepare the desired output.

Notice that during the process I used slice to subset the data frame to get the most and least item. This guarantees that there will be only one "most" and one "least" for each group. Nevertheless, it is possible that there could be a tie for each group. If that happens, you may want to think about what could be a good rule to determine which one is the "most" or which one is the "least".

library(dplyr)
library(tidyr)

df2 <- df %>%
  gather(Column, Value, -group, na.rm = TRUE) %>%
  count(group, Value) %>%
  arrange(group, desc(n)) %>%
  group_by(group) %>%
  slice(c(1, n())) %>%
  mutate(Type = c("most", "least")) %>%
  select(-n) %>%
  spread(Type, Value) %>%
  right_join(df, by = "group") %>%
  select(c(colnames(df), "most", "least"))
df2
# A tibble: 8 x 5
  group      A      B   most  least
          
1     1 yellow  green yellow  green
2     1  green yellow yellow  green
3     1 yellow    yellow  green
4     2   blue   blue   blue    red
5     2       red   blue    red
6     3 orange purple orange purple
7     3    orange orange purple
8     3 orange    orange purple

Determining most/least amount of occurrences within subset row & column group in a data frame

Answers (2)

Related Questions

Determining most/least amount of occurrences within subset row &amp; column group in a data frame

Answers (2)

Related Questions

Determining most/least amount of occurrences within subset row & column group in a data frame