Reputation: 1582
Creating a sample dataset to reproduced the problem
library(dplyr)
x <- c('MS','Google','MS','FB','Amazon','Google','IBM','IBM','IBM','MS')
item <- as.data.frame(x,stringsAsFactors = F)
data <- item %>% group_by(x) %>% summarise(n = n())
# A tibble: 5 x 2
x n
<chr> <int>
1 Amazon 1
2 FB 1
3 Google 2
4 IBM 3
5 MS 3
Now my intent to create a dataset where all the 'n' count less than 2 should be summarize in a row called 'other' meanwhile it also sum the n count like
x n
<chr> <int>
1 Other 2
2 Google 2
3 IBM 3
4 MS 3
I am able to achieve it by below mentioned code but I am sure its not good way to do this, Please suggest me if I can do same directly by dplyr query
data$x[data$n < 2]= 'Other'
data <- aggregate(n~x, data, FUN = sum)
Upvotes: 2
Views: 581
Reputation: 887951
We could also use case_when
within group_by
to change 'x' values to 'Other' where the 'n' is 1 and then do the sum
of 'n' in summarise
library(dplyr)
data %>%
group_by(x = case_when(n ==1 ~ 'Other',
TRUE ~ x)) %>%
summarise(n = sum(n))
# A tibble: 4 x 2
# x n
# <chr> <int>
#1 Google 2
#2 IBM 3
#3 MS 3
#4 Other 2
Upvotes: 2
Reputation: 20095
Another option could be achieved using bind_rows
and filter
as:
library(dplyr)
x <- c('MS','Google','MS','FB','Amazon','Google','IBM','IBM','IBM','MS')
item <- as.data.frame(x,stringsAsFactors = F)
data <- item %>% group_by(x) %>% summarise(n = n())
data %>% {
bind_rows(filter(., n >= 2),
filter(., n < 2) %>% summarise(x = "Other", n = sum(n))
)
}
# x n
# <chr> <int>
#1 Google 2
#2 IBM 3
#3 MS 3
#4 Other 2
Upvotes: 1
Reputation: 51612
Here is an idea via dplyr
,
library(dplyr)
data %>%
mutate(grp = cumsum(c(1, diff(n < 2) != 0)),
grp = replace(grp, n >=2, grp[n >= 2] + row_number()[n >= 2])) %>%
group_by(grp) %>%
summarise(x = toString(x), n = sum(n)) %>%
ungroup() %>%
select(-grp)
which gives,
# A tibble: 4 x 2 x n <chr> <int> 1 Amazon, FB 2 2 Google 2 3 IBM 3 4 MS 3
NOTE: If you really want to use Other
then, at the end of the pipe, add the following,
... %>% mutate(x = replace(x, grepl(',', x), 'Other'))
To 'decipher' the cumsum
part for the grouping lets break it down.
We want to create groups where all values within that group are less than 2. However, inevitably, we also create groups for values greater (or equal) than 2. In order to avoid summarising those groups, we replace them by adding an incremental value on them. This will ensure that groups with values greater than 2 will only have one element in, thus ensuring they won't get summarised at the end as well.
The trick to get the groups is to create a logical vector with values less than 2, and take the difference to find when it changes from TRUE to FALSE (hence the ...!= 0
part). Since the diff
will remove a value, we add it manually by c(1, diff(...))
. Note that instead of 1 you could put TRUE
. The cumsum
then creates the groups. In order to avoid summarising the groups with values > 2, we replace them by adding their row_number to them. Why row_number? Because it increases thus making all groups unique.
x <- c(1, 1, 3, 4, 2, 1, 1, 1, 5)
x < 2
#[1] TRUE TRUE FALSE FALSE FALSE TRUE TRUE TRUE FALSE
diff(x < 2) != 0
#[1] FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE
cumsum(c(1, diff(x < 2) != 0))
#[1] 1 1 2 2 2 3 3 3 4
Upvotes: 4