user3545679
user3545679

Reputation: 181

Merge dataframes in R, using shared columns and differing rows

I tried using the merge function here, but I am stumped. I apologize, because this seems basic, but the by.x and by.y functions are quite confusing to me. I would like to extract the shared columns between dataframe A and dataframe B, and then merge the two dataframes together. The dataframes do not share any Taxa (the first column) but they will share a portion of columns X1 - X10000, etc. Each of these dataframes has ~8,000 columns and a few hundred rows. In this example, variables X2 and X5 are shared, but the other variables X1 and X3 are not shared. Based on intersecting column name vectors, I know that the dataframes share ~3000 columns.

Dataframe A:

 Taxa   X1      X2      X5
 118    T       N       A
 113    N       N       A
 60     C       Y       G
 121    N       N       N

Dataframe B:

 Taxa  X2      X3      X5
 200   C       G       N
 119   T       N       G
 30    C       G       G
 21    C       N       N

Desired merged dataframe:

 Taxa    X2      X5
 118     N       A
 113     N       A
 60      Y       G
 121     N       N
 200     C       N
 119     T       G
 30      C       G
 21      C       N

When I try using the merge function, in a variety of ways, I get this (with my actual column numbers here):

      Taxa      X408050  X995019   
NA    <NA>     <NA>     <NA>       
NA.1  <NA>     <NA>     <NA>     
NA.2  <NA>     <NA>     <NA>       
NA.3  <NA>     <NA>     <NA>      
NA.4  <NA>     <NA>     <NA>     
NA.5  <NA>     <NA>     <NA>      
NA.6  <NA>     <NA>     <NA>      

Upvotes: 2

Views: 1160

Answers (1)

jazzurro
jazzurro

Reputation: 23574

Taking PierreLafortune's advice, I will leave my suggestion as an answer.Since you said you have 8000 columns in both data frames, you want to find which column names are common between the two. In order to find common columns, you can use intersect(). Once you have the necessary column names, you subset your data frames. Then, you can combine the two data frames.

ind <- intersect(names(mydf), names(mydf2))

rbind(mydf[, ind], mydf2[, ind])

#  Taxa X2 X5
#1  118  N  A
#2  113  N  A
#3   60  Y  G
#4  121  N  N
#5  200  C  N
#6  119  T  G
#7   30  C  G
#8   21  C  N

DATA

mydf <- structure(list(Taxa = c(118L, 113L, 60L, 121L), X1 = c("T", "N", 
"C", "N"), X2 = c("N", "N", "Y", "N"), X5 = c("A", "A", "G", 
"N")), .Names = c("Taxa", "X1", "X2", "X5"), class = "data.frame", row.names = c(NA, 
-4L))

mydf2 <- structure(list(Taxa = c(200L, 119L, 30L, 21L), X2 = c("C", "T", 
"C", "C"), X3 = c("G", "N", "G", "N"), X5 = c("N", "G", "G", 
"N")), .Names = c("Taxa", "X2", "X3", "X5"), class = "data.frame", row.names = c(NA, 
-4L))

Upvotes: 6

Related Questions