Reputation: 429
I have an example data frame:
df <- data.frame(x = 1:112, y = runif(112))
Is there a way to print a list of data frames with the first part of the list containing rows 1:10
, the second 11:20
, etc. up until the end (111:112
)?
Upvotes: 32
Views: 43261
Reputation: 39717
Another way using split
in combination with gl
.
n <- 10
nr <- nrow(df)
split(df, gl(ceiling(nr/n), n, nr))
gl
is creating a factor what can directly be used by split
.
Benchmark
n <- 1e5
df <- data.frame(x = 1:n, y = runif(n))
bench::mark(
"Rich Scriven" = {n <- 10
nr <- nrow(df)
split(df, rep(1:ceiling(nr/n), each=n, length.out=nr))},
GKi = {n <- 10
nr <- nrow(df)
split(df, gl(ceiling(nr/n), n, nr))}
)
# expression min median `itr/sec` mem_alloc gc/se…¹ n_itr n_gc total…²
# <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:t>
#1 Rich Scriven 411ms 444ms 2.25 3.54MB 13.5 2 12 889ms
#2 GKi 412ms 423ms 2.37 2.03MB 15.4 2 13 845ms
Using gl
instead of rep
is marginal faster and uses less memory.
Upvotes: 3
Reputation: 25484
Based on Rick's answer here is a variant that avoids instantiating copies of the split data. Instead, a callback is called with each chunk. The desired number of rows or cells can be specified.
split_df <- function(x, ..., size_cells = NULL, size_rows = NULL, callback) {
stopifnot(is.function(callback))
if (is.null(size_rows)) {
size_rows <- max(floor(size_cells / ncol(x)), 1)
}
n_rows <- nrow(x)
n_chunks <- ceiling(n_rows / size_rows)
idx <- rep(seq.int(n_chunks), each = size_rows, length.out = n_rows)
split <- split(seq_len(n_rows), idx)
lapply(split, function(i) {
callback(x[i, , drop = FALSE])
NULL
})
invisible()
}
# 30 cells = 3 rows
split_df(palmerpenguins::penguins[1:10, ], size_cells = 30, callback = print)
#> # A tibble: 3 × 8
#> species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex
#> <fct> <fct> <dbl> <dbl> <int> <int> <fct>
#> 1 Adelie Torge… 39.1 18.7 181 3750 male
#> 2 Adelie Torge… 39.5 17.4 186 3800 fema…
#> 3 Adelie Torge… 40.3 18 195 3250 fema…
#> # … with 1 more variable: year <int>
#> # A tibble: 3 × 8
#> species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex
#> <fct> <fct> <dbl> <dbl> <int> <int> <fct>
#> 1 Adelie Torge… NA NA NA NA <NA>
#> 2 Adelie Torge… 36.7 19.3 193 3450 fema…
#> 3 Adelie Torge… 39.3 20.6 190 3650 male
#> # … with 1 more variable: year <int>
#> # A tibble: 3 × 8
#> species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex
#> <fct> <fct> <dbl> <dbl> <int> <int> <fct>
#> 1 Adelie Torge… 38.9 17.8 181 3625 fema…
#> 2 Adelie Torge… 39.2 19.6 195 4675 male
#> 3 Adelie Torge… 34.1 18.1 193 3475 <NA>
#> # … with 1 more variable: year <int>
#> # A tibble: 1 × 8
#> species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex
#> <fct> <fct> <dbl> <dbl> <int> <int> <fct>
#> 1 Adelie Torge… 42 20.2 190 4250 <NA>
#> # … with 1 more variable: year <int>
# Specify number of rows instead
split_df(palmerpenguins::penguins[1:3, ], size_rows = 2, callback = print)
#> # A tibble: 2 × 8
#> species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex
#> <fct> <fct> <dbl> <dbl> <int> <int> <fct>
#> 1 Adelie Torge… 39.1 18.7 181 3750 male
#> 2 Adelie Torge… 39.5 17.4 186 3800 fema…
#> # … with 1 more variable: year <int>
#> # A tibble: 1 × 8
#> species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex
#> <fct> <fct> <dbl> <dbl> <int> <int> <fct>
#> 1 Adelie Torge… 40.3 18 195 3250 fema…
#> # … with 1 more variable: year <int>
Created on 2021-12-18 by the reprex package (v2.0.1)
Upvotes: 2
Reputation: 11232
This can be solved with nesting using tidyr/dplyr
require(dplyr)
require(tidyr)
num_groups = 10
iris %>%
group_by((row_number()-1) %/% (n()/num_groups)) %>%
nest %>% pull(data)
Upvotes: 19
Reputation: 99371
You could use split()
, with rep()
to create the groupings.
n <- 10
nr <- nrow(df)
split(df, rep(1:ceiling(nr/n), each=n, length.out=nr))
Upvotes: 57