Reputation: 1582

Summarize the row based on condition

Creating a sample dataset to reproduced the problem

library(dplyr)
x <- c('MS','Google','MS','FB','Amazon','Google','IBM','IBM','IBM','MS')
item <- as.data.frame(x,stringsAsFactors = F)
data <- item %>% group_by(x) %>% summarise(n = n())

# A tibble: 5 x 2
  x          n
  <chr>  <int>
1 Amazon     1
2 FB         1
3 Google     2
4 IBM        3
5 MS         3

Now my intent to create a dataset where all the 'n' count less than 2 should be summarize in a row called 'other' meanwhile it also sum the n count like

x          n
  <chr>  <int>
1 Other      2
2 Google     2
3 IBM        3
4 MS         3

I am able to achieve it by below mentioned code but I am sure its not good way to do this, Please suggest me if I can do same directly by dplyr query

data$x[data$n < 2]= 'Other'
data <- aggregate(n~x, data, FUN = sum)

Upvotes: 2

Answers (3)

akrun

Reputation: 887951

We could also use case_when within group_by to change 'x' values to 'Other' where the 'n' is 1 and then do the sum of 'n' in summarise

library(dplyr)
data %>% 
   group_by(x = case_when(n ==1 ~ 'Other', 
                              TRUE ~ x)) %>% 
   summarise(n = sum(n))
# A tibble: 4 x 2
#  x          n
#   <chr>  <int>
#1 Google     2
#2 IBM        3
#3 MS         3
#4 Other      2

Upvotes: 2

MKR

Reputation: 20095

Another option could be achieved using bind_rows and filter as:

library(dplyr)
x <- c('MS','Google','MS','FB','Amazon','Google','IBM','IBM','IBM','MS')
item <- as.data.frame(x,stringsAsFactors = F)
data <- item %>% group_by(x) %>% summarise(n = n())

data %>% {
  bind_rows(filter(., n >= 2), 
            filter(., n < 2) %>% summarise(x = "Other",  n = sum(n))
            )

}

#  x          n
#  <chr>  <int>
#1 Google     2
#2 IBM        3
#3 MS         3
#4 Other      2

Upvotes: 1

Sotos

Reputation: 51612

Here is an idea via dplyr,

library(dplyr)

data %>% 
 mutate(grp = cumsum(c(1, diff(n < 2) != 0)), 
        grp = replace(grp, n >=2, grp[n >= 2] + row_number()[n >= 2])) %>%
 group_by(grp) %>% 
 summarise(x = toString(x), n = sum(n)) %>% 
 ungroup() %>% 
 select(-grp)

which gives,

# A tibble: 4 x 2
  x              n
  <chr>      <int>
1 Amazon, FB     2
2 Google         2
3 IBM            3
4 MS             3

NOTE: If you really want to use Other then, at the end of the pipe, add the following,

... %>% mutate(x = replace(x, grepl(',', x), 'Other'))

To 'decipher' the cumsum part for the grouping lets break it down.

We want to create groups where all values within that group are less than 2. However, inevitably, we also create groups for values greater (or equal) than 2. In order to avoid summarising those groups, we replace them by adding an incremental value on them. This will ensure that groups with values greater than 2 will only have one element in, thus ensuring they won't get summarised at the end as well. The trick to get the groups is to create a logical vector with values less than 2, and take the difference to find when it changes from TRUE to FALSE (hence the ...!= 0 part). Since the diff will remove a value, we add it manually by c(1, diff(...)). Note that instead of 1 you could put TRUE. The cumsum then creates the groups. In order to avoid summarising the groups with values > 2, we replace them by adding their row_number to them. Why row_number? Because it increases thus making all groups unique.

x <- c(1, 1, 3, 4, 2, 1, 1, 1, 5)

x < 2
#[1]  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE
diff(x < 2) != 0
#[1] FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE
cumsum(c(1, diff(x < 2) != 0))
#[1] 1 1 2 2 2 3 3 3 4

Upvotes: 4

Summarize the row based on condition

Answers (3)

Related Questions