How to split a list of data frames based on its column names?

I have a list of more than 600 data frames, which doesn't have the same exact structure (column names, the order of the columns and the type of variable). What I need to do is to identify which of those data frames do not have the desired structure and modify it so I can work with all data for different purposes (summarize, analyses, etc).

I am trying to create two lists from the main one based on the desired names and order of the columns. For that I am trying to do the following:

# some random dfs for the example
v1 <- c(1:15)
v2 <- c(20:34)
v3 <- c("a","b","c","d","e","f","g","h","i","j","k","l","m","n","o")
v3b <- c("a","b","c","d","e","f","g","h","i","j","k","l","m","n","o")

df1 <- data.frame(v1, v2, v3)
df2 <- data.frame(v1, v2, v3)
df3 <- data.frame(v1, v2, v3b)

mylist <- list(df1, df2, df3)

names <- colnames(mylist[[1]]) #remember I have over 600 dfs in the original list
listA <- list()
listB <- list()

#I suppose this piece of code should work    
colnames(mylist[[1]]) == names
colnames(mylist[[2]]) == names
colnames(mylist[[3]]) == names

for (k in 1:length(mylist)){
  if(colnames(mylist[[k]]) == names){
    listA[[k]] <- mylist[[k]]
  }else{
    listB[[k]] <- mylist[[k]]
  }
}

Now the problem is that the loop with the conditional statements generates a list with all the data frames and a second empty list. It also generates the following warning:

1: In if (colnames(mylist[[k]]) == names) { : the condition has length > 1 and only the first element will be used

I have read and looked a lot in stack flow to solve this problem but I feel helpless...

Does anybody know what's wrong with the code? More importantly, is this an appropriate way to split my list of data frames based on the colnames or there are better ones?

Upvotes: 1

Answers (4)

meriops

Reputation: 1037

Here's a tidyverse solution, using mylist and names as you defined them:

library(tidyverse)

listA <- 
 mylist %>%
 keep(~ all(names(.) %in% names)

listB <-
 mylist %>%
 discard(~ all(names(.) %in% names)

Upvotes: 0

jay.sf

Reputation: 72593

Create groups that you get by matching the names with match(), then use split().

f <- sapply(mylist, function(x) length(na.omit(match(names(x), names))))
listNew <- setNames(split(mylist, f), c("listB", "listA"))

Yielding

> str(listNew)
List of 2
 $ listB:List of 1
  ..$ :'data.frame':    15 obs. of  3 variables:
  .. ..$ v1 : int [1:15] 1 2 3 4 5 6 7 8 9 10 ...
  .. ..$ v2 : int [1:15] 20 21 22 23 24 25 26 27 28 29 ...
  .. ..$ v3b: Factor w/ 15 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ listA:List of 2
  ..$ :'data.frame':    15 obs. of  3 variables:
  .. ..$ v1: int [1:15] 1 2 3 4 5 6 7 8 9 10 ...
  .. ..$ v2: int [1:15] 20 21 22 23 24 25 26 27 28 29 ...
  .. ..$ v3: Factor w/ 15 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10 ...
  ..$ :'data.frame':    15 obs. of  3 variables:
  .. ..$ v1: int [1:15] 1 2 3 4 5 6 7 8 9 10 ...
  .. ..$ v2: int [1:15] 20 21 22 23 24 25 26 27 28 29 ...
  .. ..$ v3: Factor w/ 15 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10 ...

Upvotes: 1

moodymudskipper

Reputation: 47300

You can correct your approach by using identical instead of ==, the k indice should also be fixed if you don't want NULL elements:

for (k in 1:length(mylist)){
  if(identical(colnames(mylist[[k]]), names)){
    listA[[length(listA)+1]] <- mylist[[k]]
  }else{
    listB[[length(listB)+1]] <- mylist[[k]]
  }
}

I'd rather use split however, here's a suggestion:

split(mylist,sapply(mylist,function(x) identical(colnames(x),names)))

$`FALSE`
$`FALSE`[[1]]
   v1 v2 v3b
1   1 20   a
2   2 21   b
3   3 22   c
4   4 23   d
5   5 24   e
6   6 25   f
7   7 26   g
8   8 27   h
9   9 28   i
10 10 29   j
11 11 30   k
12 12 31   l
13 13 32   m
14 14 33   n
15 15 34   o


$`TRUE`
$`TRUE`[[1]]
   v1 v2 v3
1   1 20  a
2   2 21  b
3   3 22  c
4   4 23  d
5   5 24  e
6   6 25  f
7   7 26  g
8   8 27  h
9   9 28  i
10 10 29  j
11 11 30  k
12 12 31  l
13 13 32  m
14 14 33  n
15 15 34  o

$`TRUE`[[2]]
   v1 v2 v3
1   1 20  a
2   2 21  b
3   3 22  c
4   4 23  d
5   5 24  e
6   6 25  f
7   7 26  g
8   8 27  h
9   9 28  i
10 10 29  j
11 11 30  k
12 12 31  l
13 13 32  m
14 14 33  n
15 15 34  o

Upvotes: 1

Rui Barradas

Reputation: 76402

If I understand what you want correctly, the following code separates the original list into two lists:

listA has all dataframes with names equal to the names of mylist[[1]];
listB has all other dataframes.

It uses *apply functions instead of explicit for loops.

nms <- lapply(mylist, names)
inx <- sapply(nms[-1], function(nm) all(nm == nms[[1]]))
inx <- c(TRUE, inx)
listA <- mylist[inx]
listB <- mylist[!inx]

Upvotes: 0

How to split a list of data frames based on its column names?

Answers (4)

Related Questions