ChrisYee90
ChrisYee90

Reputation: 429

Splitting a data frame into equal parts

I have an example data frame:

df <- data.frame(x = 1:112, y = runif(112))

Is there a way to print a list of data frames with the first part of the list containing rows 1:10, the second 11:20, etc. up until the end (111:112)?

Upvotes: 32

Views: 43261

Answers (4)

GKi
GKi

Reputation: 39717

Another way using split in combination with gl.

n <- 10
nr <- nrow(df)
split(df, gl(ceiling(nr/n), n, nr))

gl is creating a factor what can directly be used by split.


Benchmark

n <- 1e5
df <- data.frame(x = 1:n, y = runif(n))
bench::mark(
"Rich Scriven" = {n <- 10
  nr <- nrow(df)
  split(df, rep(1:ceiling(nr/n), each=n, length.out=nr))},
GKi = {n <- 10
  nr <- nrow(df)
  split(df, gl(ceiling(nr/n), n, nr))}
)
#  expression        min   median `itr/sec` mem_alloc gc/se…¹ n_itr  n_gc total…²
#  <bch:expr>   <bch:tm> <bch:tm>     <dbl> <bch:byt>   <dbl> <int> <dbl> <bch:t>
#1 Rich Scriven    411ms    444ms      2.25    3.54MB    13.5     2    12   889ms
#2 GKi             412ms    423ms      2.37    2.03MB    15.4     2    13   845ms

Using gl instead of rep is marginal faster and uses less memory.

Upvotes: 3

krlmlr
krlmlr

Reputation: 25484

Based on Rick's answer here is a variant that avoids instantiating copies of the split data. Instead, a callback is called with each chunk. The desired number of rows or cells can be specified.

split_df <- function(x, ..., size_cells = NULL, size_rows = NULL, callback) {
  stopifnot(is.function(callback))

  if (is.null(size_rows)) {
    size_rows <- max(floor(size_cells / ncol(x)), 1)
  }

  n_rows <- nrow(x)
  n_chunks <- ceiling(n_rows / size_rows)

  idx <- rep(seq.int(n_chunks), each = size_rows, length.out = n_rows)
  split <- split(seq_len(n_rows), idx)
  lapply(split, function(i) {
    callback(x[i, , drop = FALSE])
    NULL
  })
  invisible()
}

# 30 cells = 3 rows
split_df(palmerpenguins::penguins[1:10, ], size_cells = 30, callback = print)
#> # A tibble: 3 × 8
#>   species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex  
#>   <fct>   <fct>           <dbl>         <dbl>            <int>       <int> <fct>
#> 1 Adelie  Torge…           39.1          18.7              181        3750 male 
#> 2 Adelie  Torge…           39.5          17.4              186        3800 fema…
#> 3 Adelie  Torge…           40.3          18                195        3250 fema…
#> # … with 1 more variable: year <int>
#> # A tibble: 3 × 8
#>   species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex  
#>   <fct>   <fct>           <dbl>         <dbl>            <int>       <int> <fct>
#> 1 Adelie  Torge…           NA            NA                 NA          NA <NA> 
#> 2 Adelie  Torge…           36.7          19.3              193        3450 fema…
#> 3 Adelie  Torge…           39.3          20.6              190        3650 male 
#> # … with 1 more variable: year <int>
#> # A tibble: 3 × 8
#>   species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex  
#>   <fct>   <fct>           <dbl>         <dbl>            <int>       <int> <fct>
#> 1 Adelie  Torge…           38.9          17.8              181        3625 fema…
#> 2 Adelie  Torge…           39.2          19.6              195        4675 male 
#> 3 Adelie  Torge…           34.1          18.1              193        3475 <NA> 
#> # … with 1 more variable: year <int>
#> # A tibble: 1 × 8
#>   species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex  
#>   <fct>   <fct>           <dbl>         <dbl>            <int>       <int> <fct>
#> 1 Adelie  Torge…             42          20.2              190        4250 <NA> 
#> # … with 1 more variable: year <int>

# Specify number of rows instead
split_df(palmerpenguins::penguins[1:3, ], size_rows = 2, callback = print)
#> # A tibble: 2 × 8
#>   species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex  
#>   <fct>   <fct>           <dbl>         <dbl>            <int>       <int> <fct>
#> 1 Adelie  Torge…           39.1          18.7              181        3750 male 
#> 2 Adelie  Torge…           39.5          17.4              186        3800 fema…
#> # … with 1 more variable: year <int>
#> # A tibble: 1 × 8
#>   species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex  
#>   <fct>   <fct>           <dbl>         <dbl>            <int>       <int> <fct>
#> 1 Adelie  Torge…           40.3            18              195        3250 fema…
#> # … with 1 more variable: year <int>

Created on 2021-12-18 by the reprex package (v2.0.1)

Upvotes: 2

Holger Brandl
Holger Brandl

Reputation: 11232

This can be solved with nesting using tidyr/dplyr

require(dplyr) 
require(tidyr)

num_groups = 10

iris %>% 
   group_by((row_number()-1) %/% (n()/num_groups)) %>%
   nest %>% pull(data)

Upvotes: 19

Rich Scriven
Rich Scriven

Reputation: 99371

You could use split(), with rep() to create the groupings.

n <- 10
nr <- nrow(df)
split(df, rep(1:ceiling(nr/n), each=n, length.out=nr))

Upvotes: 57

Related Questions