fromtheloam
fromtheloam

Reputation: 455

Determining most/least amount of occurrences within subset row & column group in a data frame

I am trying to find the most and least amount of items within a row / column group in a larger data frame. Here is the data to make it clearer:

df <- data.frame(matrix(nrow = 8, ncol = 3))
df$X1 <- c(1, 1, 1, 2, 2, 3, 3, 3)
df$X2 <- c("yellow", "green", "yellow", "blue", NA, "orange", NA, "orange") 
df$X3 <- c("green", "yellow", NA, "blue", "red", "purple" , "orange", NA) 
names(df) <- c("group", "A", "B")

Here is what that looks like (I have NAs in the original data, so I've included them):

  group      A      B
1     1 yellow  green
2     1  green yellow
3     1 yellow   <NA>
4     2   blue   blue
5     2   <NA>    red
6     3 orange purple
7     3   <NA> orange
8     3 orange   <NA>

In the first "group", for instance, I want to determine which color occurs the most and which color occurs the least. Something that looks like this:

  group      A      B   most  least
1     1 yellow  green yellow  green
2     1  green yellow yellow  green
3     1 yellow   <NA> yellow  green
4     2   blue   blue   blue    red
5     2   <NA>    red   blue    red
6     3 orange purple orange purple
7     3   <NA> orange orange purple
8     3 orange   <NA> orange purple

I am working within a dplyr chain in the original data so I can group_by "group", but I am having a hard time figuring out a method that allows me to work within a "cluster" of two columns with differing numbers of rows. I do not need this to be done with dplyr, but I figured it might be easiest given the usefulness of group_by. Additionally, I need the result to somehow remain in the original data frame as new columns. Any suggestions?

Upvotes: 0

Views: 228

Answers (2)

alistaire
alistaire

Reputation: 43334

Two options:

  1. Reshape to long form and use summarise (or count) to aggregate, subsetting the which.max/which.min:
library(tidyverse)

df <- data_frame(group = c(1, 1, 1, 2, 2, 3, 3, 3),
                 A = c("yellow", "green", "yellow", "blue", NA, "orange", NA, "orange"),
                 B = c("green", "yellow", NA, "blue", "red", "purple" , "orange", NA))


df %>% 
    gather(var, color, A:B) %>% 
    drop_na(color) %>% 
    group_by(group, color) %>% 
    summarise(n = n()) %>% 
    summarise(most = color[which.max(n)], 
              least = color[which.min(n)]) %>% 
    left_join(df, .)
#> Joining, by = "group"
#> # A tibble: 8 x 5
#>   group      A      B   most  least
#>   <dbl>  <chr>  <chr>  <chr>  <chr>
#> 1     1 yellow  green yellow  green
#> 2     1  green yellow yellow  green
#> 3     1 yellow   <NA> yellow  green
#> 4     2   blue   blue   blue    red
#> 5     2   <NA>    red   blue    red
#> 6     3 orange purple orange purple
#> 7     3   <NA> orange orange purple
#> 8     3 orange   <NA> orange purple
  1. Sort a table of values and subset it:

    df %>% 
        group_by(group) %>% 
        mutate(most = last(names(sort(table(c(A, B))))),
               least = first(names(sort(table(c(A, B))))))
    #> # A tibble: 8 x 5
    #> # Groups:   group [3]
    #>   group      A      B   most  least
    #>   <dbl>  <chr>  <chr>  <chr>  <chr>
    #> 1     1 yellow  green yellow  green
    #> 2     1  green yellow yellow  green
    #> 3     1 yellow   <NA> yellow  green
    #> 4     2   blue   blue   blue    red
    #> 5     2   <NA>    red   blue    red
    #> 6     3 orange purple orange purple
    #> 7     3   <NA> orange orange purple
    #> 8     3 orange   <NA> orange purple
    

Upvotes: 2

www
www

Reputation: 39154

A solution uses dplyr and tidyr. The strategy is to find the "most" and "least" item and prepare a new data frame. After that, use the right_join to merge the original data frame and prepare the desired output.

Notice that during the process I used slice to subset the data frame to get the most and least item. This guarantees that there will be only one "most" and one "least" for each group. Nevertheless, it is possible that there could be a tie for each group. If that happens, you may want to think about what could be a good rule to determine which one is the "most" or which one is the "least".

library(dplyr)
library(tidyr)

df2 <- df %>%
  gather(Column, Value, -group, na.rm = TRUE) %>%
  count(group, Value) %>%
  arrange(group, desc(n)) %>%
  group_by(group) %>%
  slice(c(1, n())) %>%
  mutate(Type = c("most", "least")) %>%
  select(-n) %>%
  spread(Type, Value) %>%
  right_join(df, by = "group") %>%
  select(c(colnames(df), "most", "least"))
df2
# A tibble: 8 x 5
  group      A      B   most  least
  <dbl>  <chr>  <chr>  <chr>  <chr>
1     1 yellow  green yellow  green
2     1  green yellow yellow  green
3     1 yellow   <NA> yellow  green
4     2   blue   blue   blue    red
5     2   <NA>    red   blue    red
6     3 orange purple orange purple
7     3   <NA> orange orange purple
8     3 orange   <NA> orange purple

Upvotes: 3

Related Questions