Reputation: 1446
I have two data.frames, (df1, df2) and I would like to replace the values in columns P1-P10 the letters with the values of df1$V2
but keeping the first two columns of df2.
df1 = data.frame(V1=LETTERS, V2=rnorm(26))
df2 <- data.frame(Name=sample(LETTERS, 6), bd=sample(1:6), P1=sample(LETTERS,6), P2=sample(LETTERS, 6), P3=sample(LETTERS, 6), P4=sample(LETTERS, 6), P5=sample(LETTERS, 6), P6=sample(LETTERS, 6), P7=sample(LETTERS, 6), P8=sample(LETTERS, 6), P9=sample(LETTERS, 6), P10=sample(LETTERS, 6))
My approach is the following:
df3 <- matrix(setNames(df1[,2], df1[,1])[as.character(unlist(df2[,3:12]))], nrow=6, ncol=10)
df4 <- data.frame(cbind(df2[,1:2], df3))
Which gives me my desire output, my real data has 10,000 columns, is there any way to avoid the cbind
step or make the process fast?
> df4
Name bd X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
1 V 6 -1.8991102 0.40269050 -0.1517500 -2.5297829 1.5315622 1.4897071 1.364071 -1.2443708 -1.3197276 -0.4917057
2 T 1 -2.5297829 -0.44614123 -0.1894970 -0.6693774 -0.1517500 -1.0650962 -0.151750 -0.4461412 -0.6693774 -1.1351770
3 R 5 -0.6693774 0.09059365 -2.5297829 0.3233827 -0.9383348 -0.4461412 1.281797 1.5315622 1.4897071 -0.4461412
4 B 4 -0.4461412 -0.93833476 -1.2443708 -0.4461412 -0.1894970 -0.9383348 -1.135177 -1.8991102 -0.1894970 0.4026905
5 K 2 -1.0180271 -1.06509624 -0.1939600 -0.1894970 1.4897071 -0.6693774 -1.899110 -1.3197276 1.5315622 -0.1517500
6 Y 3 1.5315622 -0.19396005 -0.4917057 -0.4664239 -1.8991102 0.4026905 -1.065096 -0.9383348 -1.2443708 -0.4664239
Thanks
Upvotes: 3
Views: 2810
Reputation: 10619
Try some *pply
magic:
lookup<-tapply(df1$V2, df1$V1, unique) #Creates a lookup table
lookup.function<-function(x) as.numeric(lookup[as.character(x)]) #The function
df4<-data.frame(df2[,1:2], apply(df2[,3:12], 2,lookup.function )) #Builds the output
Update:
The *pply
family is much faster than merge
, at least an order of magnitude. Check this out
num<-1000
df1 = data.frame(V1=LETTERS, V2=rnorm(26))
df2<-data.frame(cbind(first=1:num,second=1:num, matrix(sample(LETTERS, num^2, replace=T), nrow=num, ncol=num)))
start<-Sys.time()
lookup<-tapply(df1$V2, df1$V1, unique)
lookup.function<-function(x) as.numeric(lookup[as.character(x)])
df4<-data.frame(cbind(df2[,1:2], data.frame(apply(df2[,3:(num+2)], 2, lookup.function ))))
(difftime(Sys.time(),start))
start<-Sys.time()
df4.merge <- "[<-"(df2, 3:num, value = df1[match(as.character(unlist(df2[3:num])), as.character(df1[[1]])), 2])
(difftime(Sys.time(),start))
sum(df4==df4.merge)==num^2
For 3000 columns and rows the *pply
combination needs 4.3s whereas merge
needs about 22s on my slow Intel. And it scales nicely. For 4000 columns and rows the respective times are 7.4 sec and 118 sec.
Upvotes: 0
Reputation: 81683
You can match
the values of df2[3:12]
in df1[[1]]
. These row numbers are used to extract the values from df1[2]
.
df2[3:12] <- df1[match(as.character(unlist(df2[3:12])),
as.character(df1[[1]])), 2]
The result (df2
):
Name bd P1 P2 P3 P4 P5 P6 P7 P8 P9 P10
1 H 5 0.1199355 0.3752010 -0.3926061 -1.1039548 -0.1107821 0.9867373 -0.3360094 -0.7488000 -0.3926061 2.0667704
2 U 4 0.1168599 0.1168599 0.9867373 1.3521418 0.9867373 -0.3360094 -0.7724007 -0.3926061 -0.3360094 -1.2543480
3 R 3 -1.2337890 -0.1107821 -0.7724007 2.0667704 0.3752010 0.4645504 0.9867373 0.1168599 -0.0981773 -0.3926061
4 G 2 -0.3926061 0.3199261 -0.0981773 -0.1107821 2.0667704 -1.1039548 -1.2337890 0.3199261 -1.2337890 -2.1534678
5 C 6 -2.1534678 -1.1039548 -1.1039548 -0.7488000 0.4645504 0.3199261 -2.1534678 -0.3360094 0.9867373 0.8771467
6 I 1 0.6171634 0.6224091 1.8011711 0.7292998 0.8771467 2.0667704 0.3752010 0.4645504 -2.1534678 -0.7724007
If you don't want to replace the values inside df2
, you can create a new data frame df4
with
df4 <- "[<-"(df2, 3:12, value = df1[match(as.character(unlist(df2[3:12])),
as.character(df1[[1]])), 2])
Upvotes: 3