Arun
Arun

Reputation: 625

Update values in dataframe from a reference data frame, based on matching of a second variable

I have two data frames in R which look like these.

> df
    Var1 Freq
1   0.01    1
2      1   27
3    100   27
4   1000  488
5   2000    4
6   5000    7
7  50000    7
8 100000    1


> return_matrix
         Var1 Freq
1     0.00001    0
2        0.01    0
3           1    0
4           8    0
5         100    0
6         200    0
7         500    0
8        1000    0
9        2000    0
10       5000    0
11      10000    0
12      50000    0
13     100000    0
14     200000    0
15     500000    0
16    1000000    0
17   10000000    0
18  100000000    0
19 1000000000    0

I wanted to do like a Vlookup kind of stuff of the above data frames so that the output will look like below:

  > combined
             Var1 Freq
    1     0.00001    0
    2        0.01    1
    3           1   27
    4           8    0
    5         100   27
    6         200    0
    7         500    0
    8        1000  488
    9        2000    4
    10       5000    7
    11      10000    0
    12      50000    7
    13     100000    1
    14     200000    0
    15     500000    0
    16    1000000    0
    17   10000000    0
    18  100000000    0

I tried to perform a merge in R using the below mentioned R code. However i am not getting the expected results like above. Could you please help me on this?

> combined_matrix <- merge(return_matrix, df, by = "Var1" )
> combined_matrix
    Var1 Freq.x Freq.y
1   0.01      0      1
2      1      0     27
3    100      0     27
4   1000      0    488
5 100000      0      1
6   2000      0      4
7   5000      0      7
8  50000      0      7

Upvotes: 1

Views: 599

Answers (3)

Joris Meys
Joris Meys

Reputation: 108603

There's no need to merge anything here.

id <- match(return_matrix$Var1, df$Var1, nomatch = 0L)
return_matrix$Freq[id != 0] <- df$Freq[id]

does the trick.

This code will outperform any merge or plyr solution by manifold. It's about 100 times faster than ddply and about 10 times faster than merge in the following simulation on my machine:

library(plyr)
library(rbenchmark)
df <- data.frame(
  Var1 = c(0.01, 1, 100, 1000, 2000, 5000, 50000,100000),
  Freq = c(1,27,27,488,4,7,7,1)
)

return_matrix <- data.frame(
  Var1 = c(0.00001,0.01,1,8,100,200,500,1000,2000,5000,
           10000,50000,100000,200000,500000,1e6,1e7,1e8),
  Freq = 0
)



codeJM <- function(df, return_matrix){
  id <- match(return_matrix$Var1, df$Var1, nomatch = 0L)
  return_matrix$Freq[id != 0] <- df$Var1[id]
  return_matrix
}

codemerge <- function(df, return_matrix){
  combined_matrix <- merge(return_matrix, df, by = "Var1" , all = TRUE)
  combined_matrix$Freq <- combined_matrix$Freq.x+combined_matrix$Freq.y
  combined_matrix$Freq.x <- combined_matrix$Freq.y <- NULL
  combined_matrix
}

codeddply <- function(df, return_matrix){
  full <- rbind(df,return_matrix)
  combined <- ddply(full ,.(Var1),function(x) 
                      data.frame(Var1=x$Var1[1],Freq=sum(x$Freq)))
  combined
}

benchmark(
  codemerge(df, return_matrix),
  codeJM(df, return_matrix),
  codeddply(df, return_matrix),
  replications = 1000
)

gives :

                          test replications elapsed relative user.self 
3 codeddply(df, return_matrix)         1000    5.38    107.6      5.37        
2    codeJM(df, return_matrix)         1000    0.05      1.0      0.05        
1 codemerge(df, return_matrix)         1000    0.51     10.2      0.52

Upvotes: 2

fhlgood
fhlgood

Reputation: 479

You can still use merge by selecting all.x = TRUE, this keeps all rows:

c<-merge(return_matrix, df, by = "Var1", all.x = TRUE)

This will create a second freq column, but you can clean it fairly easily

c<-c[, -2]
c[,2][which(is.na(c[,2]))]<- 0

Upvotes: 1

Jonas Coussement
Jonas Coussement

Reputation: 402

You can either work with the match function Joris suggests or use the ddply function from the plyr package:

library(plyr)
full <- rbind(df,return_matrix)
combined <- ddply(full ,.(Var1),function(x) data.frame(Var1=x$Var1[1],Freq=sum(x$Freq)))

This will sum the values from Freq even if they are not 0 in return_matrix

Upvotes: 1

Related Questions