rnorouzian
rnorouzian

Reputation: 7517

Extracting and cbinding similarly named variables in a data.frame in R

I have a cbind of 2 data.frames called DATA. Using BASE R, I was wondering how I could extract and then, cbind similarly named variables in DATA and store them as a list?

For the example below, I want all variable AAs, and separately all variable BBs in DATA be separately cbinded and stored as a list?

Note: names could be anything, and the number of variables could be any number. A function(al) solution is highly appreciated.

Note: suppose we have NO ACCESS to r, the only input is DATA.

r <- list(
 data.frame(Name = rep("Jacob", 6), 
           X = c(2,2,1,1,NA, NA), 
           Y = c(1,1,1,2,1,NA), 
           Z = rep(3, 6), 
         out = rep(1, 6)), 

 data.frame(Name = rep("Jon", 6), 
           X = c(1,NA,3,1,NA,NA), 
           Y = c(1,1,1,2,NA,NA), 
           Z = rep(2, 6), 
         out = rep(1, 6)), 

 data.frame(Name = rep("Jon", 6), 
            X = c(1,NA,3,1,NA,NA), 
            Y = c(1,1,1,2,2,NA), 
            Z = rep(2, 6), 
          out = rep(2, 6)), 

 data.frame(Name = rep("Jim", 6), 
            X = c(1,NA,3,1,NA,NA), 
            Y = c(1,1,1,2,2,NA), 
            Z = rep(2, 6), 
          out = rep(1, 6)))

DATA <- do.call(cbind, r)  ## DATA: cbind of two data.frames

Upvotes: 1

Views: 81

Answers (1)

akrun
akrun

Reputation: 887018

Here is an option with split. Wouldn't recommend to have same duplicate column names in the dataset. But, if it is really needed, after thee split, change the column names by removing the . following by one or more numbers at the end of it with sub

nm1 <- Reduce(intersect, lapply(r, colnames)) # get the common names
lst1 <- split.default(DATA[names(DATA) %in% nm1], names(DATA)[names(DATA) %in% nm1])
lapply(lst1, function(x) setNames(x, sub("\\.\\d+$", "", names(x))))

Or if we need to use only 'DATA' and not 'r' for finding the intersecting column names. It is difficult but we can get a frequency of the occurence of column names and select that have 2 as frequency

tbl <- table(names(DATA))
nm1 <- names(which(tbl==max(tbl)))

Use that in the split.default as before

lst1 <- split.default(DATA[names(DATA) %in% nm1], names(DATA)[names(DATA) %in% nm1])
lapply(lst1, function(x) setNames(x, sub("\\.\\d+$", "", names(x))))

Using OP's new example

r <- list( data.frame( AA = c(2,2,1,1,3,2), BB = c(1,1,1,2,2,NA), CC = 1:6), data.frame( AA = c(1,NA,3,1,3,2), BB = c(1,1,1,2,2,2)), data.frame( AA = c(1,NA,3,1,3,2), BB = c(1,1,1,2,2,2), DD = 0:5) )
DATA <- do.call(cbind, r)

tbl <- table(names(DATA))
nm1 <- names(which(tbl==max(tbl)))
lst1 <- split.default(DATA[names(DATA) %in% nm1], names(DATA)[names(DATA) %in% nm1])
lapply(lst1, function(x) setNames(x, sub("\\.\\d+$", "", names(x))))
#$AA
#  AA AA AA
#1  2  1  1
#2  2 NA NA
#3  1  3  3
#4  1  1  1
#5  3  3  3
#6  2  2  2

#$BB
#  BB BB BB
#1  1  1  1
#2  1  1  1
#3  1  1  1
#4  2  2  2
#5  2  2  2
#6 NA  2  2

Upvotes: 1

Related Questions