Reputation: 455
I am trying to find the most and least amount of items within a row / column group in a larger data frame. Here is the data to make it clearer:
df <- data.frame(matrix(nrow = 8, ncol = 3))
df$X1 <- c(1, 1, 1, 2, 2, 3, 3, 3)
df$X2 <- c("yellow", "green", "yellow", "blue", NA, "orange", NA, "orange")
df$X3 <- c("green", "yellow", NA, "blue", "red", "purple" , "orange", NA)
names(df) <- c("group", "A", "B")
Here is what that looks like (I have NAs in the original data, so I've included them):
group A B
1 1 yellow green
2 1 green yellow
3 1 yellow <NA>
4 2 blue blue
5 2 <NA> red
6 3 orange purple
7 3 <NA> orange
8 3 orange <NA>
In the first "group", for instance, I want to determine which color occurs the most and which color occurs the least. Something that looks like this:
group A B most least
1 1 yellow green yellow green
2 1 green yellow yellow green
3 1 yellow <NA> yellow green
4 2 blue blue blue red
5 2 <NA> red blue red
6 3 orange purple orange purple
7 3 <NA> orange orange purple
8 3 orange <NA> orange purple
I am working within a dplyr
chain in the original data so I can group_by
"group", but I am having a hard time figuring out a method that allows me to work within a "cluster" of two columns with differing numbers of rows. I do not need this to be done with dplyr
, but I figured it might be easiest given the usefulness of group_by
. Additionally, I need the result to somehow remain in the original data frame as new columns. Any suggestions?
Upvotes: 0
Views: 228
Reputation: 43334
Two options:
summarise
(or count
) to aggregate, subsetting the which.max
/which.min
:library(tidyverse)
df <- data_frame(group = c(1, 1, 1, 2, 2, 3, 3, 3),
A = c("yellow", "green", "yellow", "blue", NA, "orange", NA, "orange"),
B = c("green", "yellow", NA, "blue", "red", "purple" , "orange", NA))
df %>%
gather(var, color, A:B) %>%
drop_na(color) %>%
group_by(group, color) %>%
summarise(n = n()) %>%
summarise(most = color[which.max(n)],
least = color[which.min(n)]) %>%
left_join(df, .)
#> Joining, by = "group"
#> # A tibble: 8 x 5
#> group A B most least
#> <dbl> <chr> <chr> <chr> <chr>
#> 1 1 yellow green yellow green
#> 2 1 green yellow yellow green
#> 3 1 yellow <NA> yellow green
#> 4 2 blue blue blue red
#> 5 2 <NA> red blue red
#> 6 3 orange purple orange purple
#> 7 3 <NA> orange orange purple
#> 8 3 orange <NA> orange purple
Sort a table of values and subset it:
df %>%
group_by(group) %>%
mutate(most = last(names(sort(table(c(A, B))))),
least = first(names(sort(table(c(A, B))))))
#> # A tibble: 8 x 5
#> # Groups: group [3]
#> group A B most least
#> <dbl> <chr> <chr> <chr> <chr>
#> 1 1 yellow green yellow green
#> 2 1 green yellow yellow green
#> 3 1 yellow <NA> yellow green
#> 4 2 blue blue blue red
#> 5 2 <NA> red blue red
#> 6 3 orange purple orange purple
#> 7 3 <NA> orange orange purple
#> 8 3 orange <NA> orange purple
Upvotes: 2
Reputation: 39154
A solution uses dplyr
and tidyr
. The strategy is to find the "most" and "least" item and prepare a new data frame. After that, use the right_join
to merge the original data frame and prepare the desired output.
Notice that during the process I used slice
to subset the data frame to get the most and least item. This guarantees that there will be only one "most" and one "least" for each group. Nevertheless, it is possible that there could be a tie for each group. If that happens, you may want to think about what could be a good rule to determine which one is the "most" or which one is the "least".
library(dplyr)
library(tidyr)
df2 <- df %>%
gather(Column, Value, -group, na.rm = TRUE) %>%
count(group, Value) %>%
arrange(group, desc(n)) %>%
group_by(group) %>%
slice(c(1, n())) %>%
mutate(Type = c("most", "least")) %>%
select(-n) %>%
spread(Type, Value) %>%
right_join(df, by = "group") %>%
select(c(colnames(df), "most", "least"))
df2
# A tibble: 8 x 5
group A B most least
<dbl> <chr> <chr> <chr> <chr>
1 1 yellow green yellow green
2 1 green yellow yellow green
3 1 yellow <NA> yellow green
4 2 blue blue blue red
5 2 <NA> red blue red
6 3 orange purple orange purple
7 3 <NA> orange orange purple
8 3 orange <NA> orange purple
Upvotes: 3