slhck
slhck

Reputation: 38760

How can I merge multiple dataframes with the same column names?

What I have:

I have a "master" dataframe that has the following columns:

userid, condition

Since there are four experiment conditions, I also have four dataframes that carry answer information, with the following columns:

userid, condition, answer1, answer2

Now, I'd like to join these, so all combinations of user IDs, conditions and their answers to these conditions are merged. Each condition should only have the correct answer in the appropriate column, per row.


Short, self-contained example:

master = data.frame(userid=c("foo","foo","foo","foo","bar","bar","bar","bar"), condition=c("A","B","C","D","A","B","C","D"))
cond_a = data.frame(userid=c("foo","bar"), condition="A", answer1=c("1","1"), answer2=c("2","2"))
cond_b = data.frame(userid=c("foo","bar"), condition="B", answer1=c("3","3"), answer2=c("4","4"))
cond_c = data.frame(userid=c("foo","bar"), condition="C", answer1=c("5","5"), answer2=c("6","6"))
cond_d = data.frame(userid=c("foo","bar"), condition="D", answer1=c("7","7"), answer2=c("8","8"))

How do I merge all conditions into the master, so the master table looks like follows?

  userid condition answer1 answer2
1    bar         A       1       2
2    bar         B       3       4
3    bar         C       5       6
4    bar         D       7       8
5    foo         A       1       2
6    foo         B       3       4
7    foo         C       5       6
8    foo         D       7       8

I've tried the following:

temp = merge(master, cond_a, all.x=TRUE)

Which gives me:

  userid condition answer1 answer2
1    bar         A       1       2
2    bar         B    <NA>    <NA>
3    bar         C    <NA>    <NA>
4    bar         D    <NA>    <NA>
5    foo         A       1       2
6    foo         B    <NA>    <NA>
7    foo         C    <NA>    <NA>
8    foo         D    <NA>    <NA>

But as soon as I do this…

merge(temp, cond_b, all.x=TRUE)

There are no values for condition B. How come?

  userid condition answer1 answer2
1    bar         A       1       2
2    bar         B    <NA>    <NA>
3    bar         C    <NA>    <NA>
4    bar         D    <NA>    <NA>
5    foo         A       1       2
6    foo         B    <NA>    <NA>
7    foo         C    <NA>    <NA>
8    foo         D    <NA>    <NA>

Upvotes: 9

Views: 28780

Answers (3)

Clayton Stanley
Clayton Stanley

Reputation: 7784

You can express this join as a SQL statement, and then use the sqldf library to execute it.

cond_all = rbind(cond_a, cond_b, cond_c, cond_d)

> sqldf('select p.userid as userid, p.condition as condition, answer1, answer2 from master as p join cond_all as q on p.userid=q.userid and p.condition=q.condition order by userid, condition')
  userid condition answer1 answer2
1    bar         A       1       2
2    bar         B       3       4
3    bar         C       5       6
4    bar         D       7       8
5    foo         A       1       2
6    foo         B       3       4
7    foo         C       5       6
8    foo         D       7       8

You mentioned in a comment that the master dataframe has extra columns that do not exist in the cond dataframes. You should be able to modify this SQL query to still work for this case.

Upvotes: 1

A_K
A_K

Reputation: 2651

As stated by the OP, given that no explicit relationship with the master data frame, an option is this:

temp <-rbind(cond_a,cond_b,cond_c,cond_d)
temp[order(temp["userid"]),]

Perhaps if any relationship was known, there could be a non-simplistic solution.

Upvotes: 2

A5C1D2H2I1M1N2O1R2T1
A5C1D2H2I1M1N2O1R2T1

Reputation: 193677

You can use Reduce() and complete.cases() as follows:

merged <- Reduce(function(x, y) merge(x, y, all=TRUE), 
                 list(master, cond_a, cond_b, cond_c, cond_d))
merged[complete.cases(merged), ]
#    userid condition answer1 answer2
# 1     bar         A       1       2
# 2     bar         B       3       4
# 4     bar         C       5       6
# 6     bar         D       7       8
# 8     foo         A       1       2
# 9     foo         B       3       4
# 11    foo         C       5       6
# 13    foo         D       7       8

Reduce() might take some getting accustomed to. You define your function, and then provide a list of objects to repeatedly apply the function to. Thus, that statement is like doing:

temp1 <- merge(master, cond_a, all=TRUE)
temp2 <- merge(temp1, cond_b, all=TRUE)
temp3 <- merge(temp2, ....)

Or something like:

merge(merge(merge(master, cond_a, all=TRUE), cond_b, all=TRUE), cond_c, all=TRUE)

complete.cases() creates a logical vector of whether the specified columns are "complete" or not; this logical vector can be used to subset from the merged data.frame.

Upvotes: 13

Related Questions