whistler
whistler

Reputation: 886

Merge 2 data frames, discard unmatched rows

I have two data frames--one is huge (over 2 million rows) and one is smaller (around 300,000 rows). The smaller data frame is a subset of the larger one. The only difference is that the larger one has an additional attribute that I need to add to the smaller one.

Specifically, the attributes for the large data frame are (Date, Time, Address, Flag) and the attributes for the small data frame are (Date, Time, Address). I need to get the correct corresponding Flag value somehow into the smaller data frame for each row. The final size of the "merged" data frame should be the same as my smaller one, discarding the unused rows from the large data frame.

What is the best way to accomplish this?

Update: I tested the merge function with the following:

new<-merge(data12, data2, by.x = c("Date", "Time", "Address"), 
           by.y=c("Date", "Time", "Address"))

and

new<-merge(data12, data2, by = c("Date", "Time", "Address"))

both return an empty data frame (new) with the right number of attributes as well as the following warning message:

Warning message:In `[<-.factor`(`*tmp*`, ri, value = c(15640, 15843, 15843, 15161,  : invalid factor level, NAs generated

Upvotes: 0

Views: 3629

Answers (2)

marbel
marbel

Reputation: 7714

Perhaps plyr is a more intuitive package for this operation. What you need is a SQL inner join. I believe this approach is clearer than merge().

Here is a simple example of how you would use join() with data sets of your size.

library(plyr)

id = c(1:2000000)
rnormal <- rnorm(id)
rbinom <- rbinom(2000000, 5,0.5)


df1 <- data.frame(id, rnormal, rbinom)
df2 <- data.frame(id = id[1:300000], rnormal = rnormal[1:300000])

You would like to add rbinom to df2

joined.df <- join(df1, df2, type = "inner")

Here is the performance of join() vs merge()

system.time(joined.df <- join(df1, df2, type = "inner"))
Joining by: id, rnormal
   user  system elapsed 
  22.44    0.53   22.80 
system.time(merged.df <- merge(df1, df2))
   user  system elapsed 
 26.212   0.605  30.201 

Upvotes: 3

Jake Burkhead
Jake Burkhead

Reputation: 6535

    R> df1 = data.frame(a = 1:5, b = rnorm(5))
    R> df1
      a           b
    1 1 -0.09852819
    2 2 -0.47658118
    3 3 -2.14825893
    4 4  0.82216912
    5 5 -0.36285430
    R> df2 = data.frame(a = 1:10000, c = rpois(10000, 6))
    R> head(df2)
      a c
    1 1 2
    2 2 4
    3 3 5
    4 4 3
    5 5 3
    6 6 8
    R> merge(df1, df2)
      a           b c
    1 1 -0.09852819 2
    2 2 -0.47658118 4
    3 3 -2.14825893 5
    4 4  0.82216912 3
    5 5 -0.36285430 3

Upvotes: 5

Related Questions