crazysantaclaus
crazysantaclaus

Reputation: 633

Split list based on rows of list items

I'm trying to split my list of data frames into some kind of sub groups like a nested list or several lists. The split should be based on the number of rows per data frame, so data frames with the same number of rows should end up in the same list.

full_list <- list(
  df1 = replicate(10, sample(0:1, 10, replace = TRUE)),
  df2 = replicate(10, sample(0:1, 15, replace = TRUE)),
  df3 = replicate(10, sample(0:1, 20, replace = TRUE)),
  df4 = replicate(10, sample(0:1, 10, replace = TRUE))
)

There are now two data frames with nrow() == 10, so they should end up in their own list or sublist

I tried something like this, but I don't think split is applicable for lists:

sublist <- lapply(full_list, function(x) split(full_list, f = nrow(x)))

BTW: The greater goal is to split all data frames into a training and a test data set for machine learning with the function below. sample will be used to create the subsets, but I want the same sample_vector for data frames of same length. Therefore, I want to split the full list into sub lists beforehand. Afterwards I will put all data frames together again for further processing (kind of split - apply - combine). Just mentioning if I might be overcomplicating things here.

# function to split data frames in each sub list into train and test data frames 
counter <- 0
train_test_list <- list()
for (x_table in sublist) {
  counter <- counter + 1
  current_name <- paste(names(sublist)[counter], sep = "_")

  sample_vector <- sample.int(n = nrow(x_table), 
    size = floor(0.8 * nrow(x_table)), replace = FALSE)
  train_set <- x_table[sample_vector, ]
  test_set  <- x_table[-sample_vector, ]

  train_test_list[[current_name]] <- list(
    train_set = train_set, test_set = test_set, 
    table_name = names(sublist)[counter]
  )
}
# combine all lists with test and train pairs back into one list 
full_train_test_list <- c(train_test_list1, train_test_list2, train_test_list3, ...)

Upvotes: 3

Views: 370

Answers (1)

akrun
akrun

Reputation: 887991

We can get the number of rows with sapply and split based on that info

new_list <- split(full_list, sapply(full_list, nrow))
str(new_list)
#List of 3
# $ 10:List of 2
#  ..$ df1: int [1:10, 1:10] 1 0 0 1 1 0 1 0 0 1 ...
#  ..$ df4: int [1:10, 1:10] 1 0 1 1 1 0 0 0 1 1 ...
# $ 15:List of 1
#  ..$ df2: int [1:15, 1:10] 0 1 1 0 0 0 0 0 0 1 ...
# $ 20:List of 1
#  ..$ df3: int [1:20, 1:10] 1 1 0 1 0 1 1 1 0 1 ...

As it is a nested list, we can do the processing in the inner list by calling lapply inside the first lapply

traintestlst <- lapply(new_list, function(sublst) lapply(sublst, function(x_table) {

     sample_vector <- sample.int(n = nrow(x_table), 
                size = floor(0.8 * nrow(x_table)), replace = FALSE)
      train_set <- x_table[sample_vector, ]
      test_set  <- x_table[-sample_vector, ]
      list(train_set = train_set, test_set = test_set)


     })
    )

-checking the output

traintestlst[[1]]$df1
#$train_set
#     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#[1,]    1    1    0    1    0    0    1    1    1     0
#[2,]    1    0    1    1    1    0    0    0    1     0
#[3,]    0    1    0    0    1    1    0    1    1     0
#[4,]    1    1    0    1    0    0    1    0    0     1
#[5,]    0    0    0    1    0    0    1    0    1     0
#[6,]    0    1    1    0    1    0    1    0    1     0
#[7,]    1    0    1    1    0    0    0    0    0     1
#[8,]    0    1    0    0    0    1    0    0    1     0

#$test_set
#     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#[1,]    0    0    0    0    0    1    0    1    0     1
#[2,]    1    0    0    0    0    0    0    1    1     0

Upvotes: 4

Related Questions