Reputation: 93
I have a list of more than 600 data frames, which doesn't have the same exact structure (column names, the order of the columns and the type of variable). What I need to do is to identify which of those data frames do not have the desired structure and modify it so I can work with all data for different purposes (summarize, analyses, etc).
I am trying to create two lists from the main one based on the desired names and order of the columns. For that I am trying to do the following:
# some random dfs for the example
v1 <- c(1:15)
v2 <- c(20:34)
v3 <- c("a","b","c","d","e","f","g","h","i","j","k","l","m","n","o")
v3b <- c("a","b","c","d","e","f","g","h","i","j","k","l","m","n","o")
df1 <- data.frame(v1, v2, v3)
df2 <- data.frame(v1, v2, v3)
df3 <- data.frame(v1, v2, v3b)
mylist <- list(df1, df2, df3)
names <- colnames(mylist[[1]]) #remember I have over 600 dfs in the original list
listA <- list()
listB <- list()
#I suppose this piece of code should work
colnames(mylist[[1]]) == names
colnames(mylist[[2]]) == names
colnames(mylist[[3]]) == names
for (k in 1:length(mylist)){
if(colnames(mylist[[k]]) == names){
listA[[k]] <- mylist[[k]]
}else{
listB[[k]] <- mylist[[k]]
}
}
Now the problem is that the loop with the conditional statements generates a list with all the data frames and a second empty list. It also generates the following warning:
1: In if (colnames(mylist[[k]]) == names) { : the condition has length > 1 and only the first element will be used
I have read and looked a lot in stack flow to solve this problem but I feel helpless...
Does anybody know what's wrong with the code? More importantly, is this an appropriate way to split my list of data frames based on the colnames or there are better ones?
Upvotes: 1
Views: 717
Reputation: 1037
Here's a tidyverse
solution, using mylist
and names
as you defined them:
library(tidyverse)
listA <-
mylist %>%
keep(~ all(names(.) %in% names)
listB <-
mylist %>%
discard(~ all(names(.) %in% names)
Upvotes: 0
Reputation: 72593
Create groups that you get by matching the names with match()
, then use split()
.
f <- sapply(mylist, function(x) length(na.omit(match(names(x), names))))
listNew <- setNames(split(mylist, f), c("listB", "listA"))
Yielding
> str(listNew)
List of 2
$ listB:List of 1
..$ :'data.frame': 15 obs. of 3 variables:
.. ..$ v1 : int [1:15] 1 2 3 4 5 6 7 8 9 10 ...
.. ..$ v2 : int [1:15] 20 21 22 23 24 25 26 27 28 29 ...
.. ..$ v3b: Factor w/ 15 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10 ...
$ listA:List of 2
..$ :'data.frame': 15 obs. of 3 variables:
.. ..$ v1: int [1:15] 1 2 3 4 5 6 7 8 9 10 ...
.. ..$ v2: int [1:15] 20 21 22 23 24 25 26 27 28 29 ...
.. ..$ v3: Factor w/ 15 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10 ...
..$ :'data.frame': 15 obs. of 3 variables:
.. ..$ v1: int [1:15] 1 2 3 4 5 6 7 8 9 10 ...
.. ..$ v2: int [1:15] 20 21 22 23 24 25 26 27 28 29 ...
.. ..$ v3: Factor w/ 15 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10 ...
Upvotes: 1
Reputation: 47300
You can correct your approach by using identical
instead of ==
, the k
indice should also be fixed if you don't want NULL
elements:
for (k in 1:length(mylist)){
if(identical(colnames(mylist[[k]]), names)){
listA[[length(listA)+1]] <- mylist[[k]]
}else{
listB[[length(listB)+1]] <- mylist[[k]]
}
}
I'd rather use split however, here's a suggestion:
split(mylist,sapply(mylist,function(x) identical(colnames(x),names)))
$`FALSE`
$`FALSE`[[1]]
v1 v2 v3b
1 1 20 a
2 2 21 b
3 3 22 c
4 4 23 d
5 5 24 e
6 6 25 f
7 7 26 g
8 8 27 h
9 9 28 i
10 10 29 j
11 11 30 k
12 12 31 l
13 13 32 m
14 14 33 n
15 15 34 o
$`TRUE`
$`TRUE`[[1]]
v1 v2 v3
1 1 20 a
2 2 21 b
3 3 22 c
4 4 23 d
5 5 24 e
6 6 25 f
7 7 26 g
8 8 27 h
9 9 28 i
10 10 29 j
11 11 30 k
12 12 31 l
13 13 32 m
14 14 33 n
15 15 34 o
$`TRUE`[[2]]
v1 v2 v3
1 1 20 a
2 2 21 b
3 3 22 c
4 4 23 d
5 5 24 e
6 6 25 f
7 7 26 g
8 8 27 h
9 9 28 i
10 10 29 j
11 11 30 k
12 12 31 l
13 13 32 m
14 14 33 n
15 15 34 o
Upvotes: 1
Reputation: 76402
If I understand what you want correctly, the following code separates the original list into two lists:
listA
has all dataframes with names equal to the names of mylist[[1]]
;listB
has all other dataframes.It uses *apply
functions instead of explicit for
loops.
nms <- lapply(mylist, names)
inx <- sapply(nms[-1], function(nm) all(nm == nms[[1]]))
inx <- c(TRUE, inx)
listA <- mylist[inx]
listB <- mylist[!inx]
Upvotes: 0