codeforfun
codeforfun

Reputation: 187

Split dataframe into certain number of groups in R

I have a dataframe with 285000 records and I want to split it in 10 dataframes that I could save and access easily. I am trying to split it like this but I am not sure how to save all dataframes separately:

groups <- c("x1", "x2", "x3", "x4", "x5", "x6", "x7", "x8", "x9", "x10")
X <- split(data5, f = groups)

Like this I only receive one subset dataframe.

Upvotes: 0

Views: 1356

Answers (2)

Serkan
Serkan

Reputation: 1955

If you want to split your data and save it seperately, I would recommend the following approach using tidyverse.

Split the data

# libraries;
library(tidyverse)
library(data.table)

# split data according to some
# variable and store

data_list <- mtcars %>% split(
        f = .$cyl
) %>% set_names(
        nm = paste("cylinder", names(.), sep = "")
)

Here, f = .$cyl refers to your grouping variable in the dataset of interest. In this example Ive split the data according to cyl in mtcars.

The function splits according to each level inside the data. In this case 4, 6 and 8 cylinders.

I proceed with set_names from purrr to name each element of the list accordingly.

Saving the data

# store and save locally
# by using map

map(
        .x = 1:length(data_list),
        .f = function(i) {
                
                # set name of data to save locally
                path <- paste(names(data_list[i]), ".csv", sep = "")
                
                # save with fwrite
                fwrite(
                        data_list[[i]],
                        file = path,
                        sep  = ";"
                )
                
                
        }
)

I use map to iterate through the entire length of the list which split creates, and save them locally according to the names we set above with fwrite from data.table for better performance.

Note that in the script each data is saves as paste(names(data_list[i]), ".csv", sep = ""), which evaluates to cylinder4.csv, cylinder6.csv and cylinder8.csv.

The same approach to your data should be readily applicable with minor changes in the script.

Best

Upvotes: 2

Bill O&#39;Brien
Bill O&#39;Brien

Reputation: 882

If you want to arbitrarily split a big dataframe into little ones, you can add to the dataframe a uniformly distributed grouping variable, then use split.

df <- data.frame(group = rep(1:3, 4),
                 val = runif(12))

df

   group       val
1      1 0.5883321
2      2 0.5704967
3      3 0.7866597
4      1 0.8685778
5      2 0.6580090
6      3 0.1036386
7      1 0.7858867
8      2 0.2679281
9      3 0.2577965
10     1 0.6040585
11     2 0.6987716
12     3 0.2328914
> 

split(df, x$groupVal)

> $a
   group       val
2      2 0.5704967
5      2 0.6580090
8      2 0.2679281
11     2 0.6987716

$b
   group       val
1      1 0.5883321
4      1 0.8685778
7      1 0.7858867
10     1 0.6040585

$c
   group       val
3      3 0.7866597
6      3 0.1036386
9      3 0.2577965
12     3 0.2328914

Upvotes: 0

Related Questions