RnD
RnD

Reputation: 1182

R- How to merge multiple dataframes of different lengths?

I have been stuck with this issue for a while now. Need some help. I am reading the following files (which can be mire than 3 files files) into a dataframe. My input files look like the following: file1:

someName    someMOD someID
A   T754(P),M691(O),S692(P),S694(P),S739(P),S740(P),S759(P),S762(P) 1
B   S495(P) 2
C   S162(P),Q159(D) 3
D   S45(P),C47(C),S48(P),S26(P) 4
E   S18(P)  5

file2:

someName    someMOD someID
C   S162(P),Q159(D) 3
D   S45(P),C47(C),S48(P),S26(P) 4
F   S182(P) 6
E   S18(P)  5
Z   Q100(P) 9
A   T754(P),M691(O),S694(P),S739(P),S740(P) 1

file3:

someName    someMOD someID
A   T754(P),M691(O),S692(P),S694(P),S739(P),S740(P),S759(P) 1
B   S495(P) 2
D   S45(P),C47(C),S48(P),S26(P) 4
E   S18(P)  5
F   S182(P) 6
L   Z182(P) 8
C   S162(P),Q159(D) 3

My Code:

  fileList <- dir(pattern="*.xls")
  i<-1
  j<-1
  a<-list()
  mybigtable<-data.frame

  for (f in 1:length(fileList)){
    fileName <- fileList[f]
    X <-read.xls(fileName)
    if(regexpr("Drug_Rep", fileName)[1]>0){
      a[[i]]<-X
     }
      i=i+1
    }
    else{
        #Don't do anything
    }
  }

  #Now i want to merge my dataframes
  mymerge <- function(x, y)
    merge(x, y, by=c("someName", "someID"), all=TRUE))

  Reduce(mymerge,a) #passing my list of dataframes 'a'

I did dput() on my 'a' list:

list(structure(list(someName = structure(c(1L, 2L, 4L, 5L, 6L, 
7L, 3L), .Label = c("A", "B", "C", "D", "E", "F", "L"), class = "factor"), 
    someMOD = structure(c(6L, 5L, 4L, 2L, 3L, 7L, 1L), .Label = c("S162(P),Q159(D)", 
    "S18(P)", "S182(P)", "S45(P),C47(C),S48(P),S26(P)", "S495(P)", 
    "T754(P),M691(O),S692(P),S694(P),S739(P),S740(P),S759(P)", 
    "Z182(P)"), class = "factor"), someID = c(1L, 2L, 4L, 5L, 
    6L, 8L, 3L)), .Names = c("someName", "someMOD", "someID"), class = "data.frame", row.names = c(NA, 
-7L)), structure(list(someName = structure(1:5, .Label = c("A", 
"B", "C", "D", "E"), class = "factor"), someMOD = structure(c(5L, 
4L, 1L, 3L, 2L), .Label = c("S162(P),Q159(D)", "S18(P)", "S45(P),C47(C),S48(P),S26(P)", 
"S495(P)", "T754(P),M691(O),S692(P),S694(P),S739(P),S740(P),S759(P),S762(P)"
), class = "factor"), someID = 1:5), .Names = c("someName", "someMOD", 
"someID"), class = "data.frame", row.names = c(NA, -5L)), structure(list(
    someName = structure(c(2L, 3L, 5L, 4L, 6L, 1L), .Label = c("A", 
    "C", "D", "E", "F", "Z"), class = "factor"), someMOD = structure(c(2L, 
    5L, 4L, 3L, 1L, 6L), .Label = c("Q100(P)", "S162(P),Q159(D)", 
    "S18(P)", "S182(P)", "S45(P),C47(C),S48(P),S26(P)", "T754(P),M691(O),S694(P),S739(P),S740(P)"
    ), class = "factor"), someID = c(3L, 4L, 6L, 5L, 9L, 1L)), .Names = c("someName", 
"someMOD", "someID"), class = "data.frame", row.names = c(NA, 
-6L)))

What is my mistake in populating a list? Any help is really appreciated. I am just trying to get an out put like the following:

enter image description here

Upvotes: 0

Views: 535

Answers (2)

Hong Ooi
Hong Ooi

Reputation: 57686

The problem with the code I gave you before is that merge gets confused if there are any duplicate column names, and you're merging more than 3 datasets. You'll have to rename your someMOD columns so they don't clash. A for loop works as well as anything for this purpose.

dupvars <- which(!names(a[[1]]) %in% c("someName", "someID"))
for(i in seq_along(a))
    names(a[[i]])[dupvars] <- paste0(names(a[[i]])[dupvars], i)

# and then merge
Reduce(mymerge, a)

Upvotes: 2

A5C1D2H2I1M1N2O1R2T1
A5C1D2H2I1M1N2O1R2T1

Reputation: 193517

Perhaps the problem is that you're actually not trying to merge in the standard sense, but reshape. In this case, you can rbind all the data.frames together after adding a "time" variable, and use dcast from "reshape2" to get what you're after:

Add a "time" variable and rbind the data.frames together

temp <- do.call(rbind, 
                lapply(seq_along(a), 
                       function(x) data.frame(a[[x]], time = x)))
head(temp)
#   someName                                                 someMOD someID time
# 1        A T754(P),M691(O),S692(P),S694(P),S739(P),S740(P),S759(P)      1    1
# 2        B                                                 S495(P)      2    1
# 3        D                             S45(P),C47(C),S48(P),S26(P)      4    1
# 4        E                                                  S18(P)      5    1
# 5        F                                                 S182(P)      6    1
# 6        L                                                 Z182(P)      8    1

Transform the data.frame from a "long" format to a "wide" format

library(reshape2)
dcast(temp, someName + someID ~ time, value.var="someMOD")
#   someName someID                                                       1
# 1        A      1 T754(P),M691(O),S692(P),S694(P),S739(P),S740(P),S759(P)
# 2        B      2                                                 S495(P)
# 3        C      3                                         S162(P),Q159(D)
# 4        D      4                             S45(P),C47(C),S48(P),S26(P)
# 5        E      5                                                  S18(P)
# 6        F      6                                                 S182(P)
# 7        L      8                                                 Z182(P)
# 8        Z      9                                                    <NA>
#                                                                 2
# 1 T754(P),M691(O),S692(P),S694(P),S739(P),S740(P),S759(P),S762(P)
# 2                                                         S495(P)
# 3                                                 S162(P),Q159(D)
# 4                                     S45(P),C47(C),S48(P),S26(P)
# 5                                                          S18(P)
# 6                                                            <NA>
# 7                                                            <NA>
# 8                                                            <NA>
#                                         3
# 1 T754(P),M691(O),S694(P),S739(P),S740(P)
# 2                                    <NA>
# 3                         S162(P),Q159(D)
# 4             S45(P),C47(C),S48(P),S26(P)
# 5                                  S18(P)
# 6                                 S182(P)
# 7                                    <NA>
# 8                                 Q100(P)

Upvotes: 1

Related Questions