Reputation: 259
I'm new to R programming and just started to learn it, and i need you to help me with this issue please.
I have 2 data frames :
the first(df1):
V1 V2
A A
A B
A C
B A
B B
B C
etc
the second(df2) :
V1 Va Vb
A 12 23
B 15 53
C 321 543
D 54 325
etc..
Use this code to generate the sample data.
df1 <- data.frame(
V1 = rep(LETTERS[1:2], each = 3L),
V2 = rep.int(LETTERS[1:3], 2L)
)
dfr2 <- data.frame(
Va = c(12, 15, 312, 54),
Vb = c(23, 53, 543, 325)
)
I need to take the Va and Vb from df2 and place them in df1 based on V1 and V2 of df1.
so I want this output:
df3:
V1 V2 Va1 Vb1 Va2 Vb2
A A 12 23 12 23
A B 12 23 15 23
A C 12 23 321 543
B A 15 23 12 23
B B 15 23 15 23
B C 15 23 321 543
hope that this can be done in R without a dozenz of for loops :S.
Upvotes: 1
Views: 3534
Reputation: 664
I think Hadley Wickham's dplyr package has some great tools for combining dataframes. It's a different way to arrive at the same place.
df1 <- data.frame(
V1 = rep(LETTERS[1:2], each = 3L),
V2 = rep.int(LETTERS[1:3], 2L)
)
dfr2 <- data.frame(
V1 = LETTERS[1:4],
Va = c(12, 15, 312, 54),
Vb = c(23, 53, 543, 325)
)
##necessary libraries
library(magrittr, dplyr)
Now, using the left_join()
and rename()
functions from the dplyr
package, and the pipe operator from the magrittr
package, I think you can save a lot of keystrokes.
df3 <- df1 %>% #pipe operator (%>%) is from magrittr package
dplyr::left_join(dfr2, c('V1'="V1")) %>% #merge once
dplyr::rename(Va1 = Va, Vb1 = Vb) %>% #rename columns
dplyr::left_join(dfr2, c('V2'="V1")) %>% #merge on different column
dplyr::rename(Va2 = Va, Vb2 = Vb) #rename again
This can also be done with one merge instead of two, but I like to see what's going on in each step.
> print(df3)
V1 V2 Va1 Vb1 Va2 Vb2
1 A A 12 23 12 23
2 A B 12 23 15 53
3 A C 12 23 312 543
4 B A 15 53 12 23
5 B B 15 53 15 53
6 B C 15 53 312 543
Upvotes: 1
Reputation: 108543
You can use the power of match() for this, provided the names in df1$V1 are unique :
#some data
df1 <- data.frame(
V1 = rep(c("A","B"),each=3),
V2 = rep(LETTERS[1:3],2)
)
df2 <- data.frame(V1=LETTERS[1:3],Va=1:3,Vb=3:1)
out <- cbind(df1,
df2[match(df1$V1,df2$V1),-1],
df2[match(df1$V2,df2$V1),-1]
)
names(out)[3:6] <- c("Va1","Vb1","Va2","Vb2")
rownames(out) <- 1:nrow(out)
Gives
> out
V1 V2 Va1 Vb1 Va2 Vb2
1 A A 1 3 1 3
2 A B 1 3 2 2
3 A C 1 3 3 1
4 B A 2 2 1 3
5 B B 2 2 2 2
6 B C 2 2 3 1
You'll have to rename the columns manually, as you would get multiple columns with the same name. Although it's technically possible in a dataframe, it can cause trouble later on. You can automatize this using something along the lines of:
names(out) <-
c("V1","V2",
sapply(names(df2)[2:3],paste,1:2,sep="")
)
EDIT : for big dataframes, conversion to matrices makes another huge difference. One has to pay attention to the intrinsic changes in type for the different variables. The speedup is due to the fact that cbind and merge take a whole lot of time figuring out the correct types for each variable.
With following data and functions :
n <- 1e5
df1 <- data.frame(V1 = rep(LETTERS,each=n),V2 = rep(LETTERS,n),
stringsAsFactors=FALSE)
df2 <- data.frame(V1=LETTERS,Va=1:26,Vb=26:1,stringsAsFactors=FALSE)
fast_JM <- function(df1,df2){
out <- cbind(
as.matrix(df2[,-1])[match(df1$V1,df2$V1),],
as.matrix(df2[,-1])[match(df1$V2,df2$V1),]
)
out <- as.data.frame(out)
names(out) <- sapply(names(df2)[2:3],paste,1:2,sep="")
out$V1 <- df1$V1
out$V2 <- df1$V2
out
}
slow_JM <- function(df1,df2){
out <- cbind(df1,
df2[match(df1$V1,df2$V1),-1],
df2[match(df1$V2,df2$V1),-1]
)
names(out)[3:6] <- c("Va1","Vb1","Va2","Vb2")
out
}
double_merge <- function(df1,df2){
merge(merge(df1, df2), df2, by.x = "V2", by.y = "V1", suffixes = c("1", "2"))
}
the benchmarking becomes :
require(rbenchmark)
benchmark(fast_JM(df1,df2),slow_JM(df1,df2),double_merge(df1,df2),
replications=1,columns=c("test","elapsed","relative"),order="relative")
test elapsed relative
1 fast_JM(df1, df2) 0.89 1.00000
2 slow_JM(df1, df2) 12.54 14.08989
3 double_merge(df1, df2) 42.50 47.75281
So a speedup of more than 40 times compared to the double merge, and more than 10 times compared to using dataframes.
Upvotes: 1
Reputation: 69171
You can use merge()
twice to get what you want. By default, merge
looks for common column names to join. In the second merge, we'll specify the column we want it to merge on:
df1 <- data.frame(V1 = c('A', 'A', 'A', 'B', 'B', 'B'), V2 = c('A', 'B', 'C', 'A', 'B', 'C'))
df2 <- data.frame(V1 = c('A', 'B', 'C', 'D'), Va = c(12, 15, 321, 54), Vb = c(23, 53, 543, 325))
merge(merge(df1, df2), df2, by.x = "V2", by.y = "V1", suffixes = c("1", "2"))
Upvotes: 4