Emman
Emman

Reputation: 4201

In grouped data, how to generate a new column that accounts for group size each row belongs to, using R package {collapse}?

In a grouped data frame, I want to generate a new column for the size (number of observations) of each group.

Whereas this is easy enough with dplyr functions, I struggle to find how to do it using tools from collapse package.

Let's take the mpg dataset from ggplot2. Once we group the data by manufacturer, we then want to compute an additional column that accounts for whether each row belongs to a group with of size > 10. With dplyr it's as easy as mutate(more_than_10_obs = n() > 10).

library(ggplot2) ## for the `mpg` dataset
library(dplyr, warn.conflicts = FALSE)   

mpg %>%
  select(manufacturer) %>%
  group_by(manufacturer) %>%
  mutate(more_than_10_obs = n() > 10)
#> # A tibble: 234 x 2
#> # Groups:   manufacturer [15]
#>    manufacturer more_than_10_obs
#>    <chr>        <lgl>           
#>  1 audi         TRUE            
#>  2 audi         TRUE            
#>  3 audi         TRUE            
#>  4 audi         TRUE            
#>  5 audi         TRUE            
#>  6 audi         TRUE            
#>  7 audi         TRUE            
#>  8 audi         TRUE            
#>  9 audi         TRUE            
#> 10 audi         TRUE            
#> # ... with 224 more rows

Created on 2021-08-24 by the reprex package (v2.0.0)

But I have very large data and I want to speed up processing time, therefore use tools from collapse package. So following this answer, I've tried to first generate a new column that simply gives the number of observations in the group that each row belongs to.

library(collapse)
library(ggplot2)
library(magrittr) ## for pipe operator

mpg %>%
  fselect(manufacturer) %>%
  fgroup_by(manufacturer) %>%
  ftransform(idx = seq_row(.)) %>%
  ftransform(n_in_group = fNobs(idx))
#> # A tibble: 234 x 3
#>    manufacturer   idx n_in_group
#>  * <chr>        <int>      <int>
#>  1 audi             1        234
#>  2 audi             2        234
#>  3 audi             3        234
#>  4 audi             4        234
#>  5 audi             5        234
#>  6 audi             6        234
#>  7 audi             7        234
#>  8 audi             8        234
#>  9 audi             9        234
#> 10 audi            10        234
#> # ... with 224 more rows
#> 
#> Grouped by:  manufacturer  [15 | 16 (11)]

Created on 2021-08-24 by the reprex package (v2.0.0)


My attempt is unsuccessful because column n_in_group gives the total number of rows in mpg, rather than the size of each group.

How can I utilize collapse tools to achieve the same output as in my dplyr-based code shown at the beginning?

Upvotes: 1

Views: 82

Answers (1)

lroha
lroha

Reputation: 34501

An equivalent method is:

library(collapse)
library(dplyr)

mpg %>%
  fselect(manufacturer) %>%
  ftransform(more_than_10_obs = fnobs(manufacturer, manufacturer, TRA = "replace") > 10)

# A tibble: 234 x 2
   manufacturer more_than_10_obs
 * <chr>        <lgl>           
 1 audi         TRUE            
 2 audi         TRUE            
 3 audi         TRUE            
 4 audi         TRUE            
 5 audi         TRUE            
 6 audi         TRUE            
 7 audi         TRUE            
 8 audi         TRUE            
 9 audi         TRUE            
10 audi         TRUE            
# ... with 224 more rows

Check that it gives the same result as dplyr:

collapse_res <- mpg %>%
  fselect(manufacturer) %>%
  ftransform(more_than_10_obs = fnobs(manufacturer, manufacturer, TRA = "replace") > 10)

dplyr_res <- mpg %>%
  select(manufacturer) %>%
  group_by(manufacturer) %>%
  mutate(more_than_10_obs  = n() > 10) %>%
  ungroup()

identical(dplyr_res, collapse_res)

[1] TRUE

Edit:

To group by multiple variables, simply pass them as a list:

mpg %>%
  fselect(manufacturer, class) %>%
  ftransform(more_than_10_obs = fnobs(manufacturer, list(manufacturer, class), TRA = "replace") > 10)

Upvotes: 1

Related Questions