Reputation: 4201
In a grouped data frame, I want to generate a new column for the size (number of observations) of each group.
Whereas this is easy enough with dplyr
functions, I struggle to find how to do it using tools from collapse
package.
Let's take the mpg
dataset from ggplot2
. Once we group the data by manufacturer
, we then want to compute an additional column that accounts for whether each row belongs to a group with of size > 10. With dplyr
it's as easy as mutate(more_than_10_obs = n() > 10)
.
library(ggplot2) ## for the `mpg` dataset
library(dplyr, warn.conflicts = FALSE)
mpg %>%
select(manufacturer) %>%
group_by(manufacturer) %>%
mutate(more_than_10_obs = n() > 10)
#> # A tibble: 234 x 2
#> # Groups: manufacturer [15]
#> manufacturer more_than_10_obs
#> <chr> <lgl>
#> 1 audi TRUE
#> 2 audi TRUE
#> 3 audi TRUE
#> 4 audi TRUE
#> 5 audi TRUE
#> 6 audi TRUE
#> 7 audi TRUE
#> 8 audi TRUE
#> 9 audi TRUE
#> 10 audi TRUE
#> # ... with 224 more rows
Created on 2021-08-24 by the reprex package (v2.0.0)
But I have very large data and I want to speed up processing time, therefore use tools from collapse
package. So following this answer, I've tried to first generate a new column that simply gives the number of observations in the group that each row belongs to.
library(collapse)
library(ggplot2)
library(magrittr) ## for pipe operator
mpg %>%
fselect(manufacturer) %>%
fgroup_by(manufacturer) %>%
ftransform(idx = seq_row(.)) %>%
ftransform(n_in_group = fNobs(idx))
#> # A tibble: 234 x 3
#> manufacturer idx n_in_group
#> * <chr> <int> <int>
#> 1 audi 1 234
#> 2 audi 2 234
#> 3 audi 3 234
#> 4 audi 4 234
#> 5 audi 5 234
#> 6 audi 6 234
#> 7 audi 7 234
#> 8 audi 8 234
#> 9 audi 9 234
#> 10 audi 10 234
#> # ... with 224 more rows
#>
#> Grouped by: manufacturer [15 | 16 (11)]
Created on 2021-08-24 by the reprex package (v2.0.0)
My attempt is unsuccessful because column n_in_group
gives the total number of rows in mpg
, rather than the size of each group.
How can I utilize collapse
tools to achieve the same output as in my dplyr
-based code shown at the beginning?
Upvotes: 1
Views: 82
Reputation: 34501
An equivalent method is:
library(collapse)
library(dplyr)
mpg %>%
fselect(manufacturer) %>%
ftransform(more_than_10_obs = fnobs(manufacturer, manufacturer, TRA = "replace") > 10)
# A tibble: 234 x 2
manufacturer more_than_10_obs
* <chr> <lgl>
1 audi TRUE
2 audi TRUE
3 audi TRUE
4 audi TRUE
5 audi TRUE
6 audi TRUE
7 audi TRUE
8 audi TRUE
9 audi TRUE
10 audi TRUE
# ... with 224 more rows
Check that it gives the same result as dplyr
:
collapse_res <- mpg %>%
fselect(manufacturer) %>%
ftransform(more_than_10_obs = fnobs(manufacturer, manufacturer, TRA = "replace") > 10)
dplyr_res <- mpg %>%
select(manufacturer) %>%
group_by(manufacturer) %>%
mutate(more_than_10_obs = n() > 10) %>%
ungroup()
identical(dplyr_res, collapse_res)
[1] TRUE
Edit:
To group by multiple variables, simply pass them as a list:
mpg %>%
fselect(manufacturer, class) %>%
ftransform(more_than_10_obs = fnobs(manufacturer, list(manufacturer, class), TRA = "replace") > 10)
Upvotes: 1