Reputation: 886
I have two data frames--one is huge (over 2 million rows) and one is smaller (around 300,000 rows). The smaller data frame is a subset of the larger one. The only difference is that the larger one has an additional attribute that I need to add to the smaller one.
Specifically, the attributes for the large data frame are (Date, Time, Address, Flag) and the attributes for the small data frame are (Date, Time, Address). I need to get the correct corresponding Flag value somehow into the smaller data frame for each row. The final size of the "merged" data frame should be the same as my smaller one, discarding the unused rows from the large data frame.
What is the best way to accomplish this?
Update: I tested the merge function with the following:
new<-merge(data12, data2, by.x = c("Date", "Time", "Address"),
by.y=c("Date", "Time", "Address"))
and
new<-merge(data12, data2, by = c("Date", "Time", "Address"))
both return an empty data frame (new) with the right number of attributes as well as the following warning message:
Warning message:In `[<-.factor`(`*tmp*`, ri, value = c(15640, 15843, 15843, 15161, : invalid factor level, NAs generated
Upvotes: 0
Views: 3629
Reputation: 7714
Perhaps plyr is a more intuitive package for this operation. What you need is a SQL inner join. I believe this approach is clearer than merge().
Here is a simple example of how you would use join() with data sets of your size.
library(plyr)
id = c(1:2000000)
rnormal <- rnorm(id)
rbinom <- rbinom(2000000, 5,0.5)
df1 <- data.frame(id, rnormal, rbinom)
df2 <- data.frame(id = id[1:300000], rnormal = rnormal[1:300000])
You would like to add rbinom to df2
joined.df <- join(df1, df2, type = "inner")
Here is the performance of join() vs merge()
system.time(joined.df <- join(df1, df2, type = "inner"))
Joining by: id, rnormal
user system elapsed
22.44 0.53 22.80
system.time(merged.df <- merge(df1, df2))
user system elapsed
26.212 0.605 30.201
Upvotes: 3
Reputation: 6535
R> df1 = data.frame(a = 1:5, b = rnorm(5))
R> df1
a b
1 1 -0.09852819
2 2 -0.47658118
3 3 -2.14825893
4 4 0.82216912
5 5 -0.36285430
R> df2 = data.frame(a = 1:10000, c = rpois(10000, 6))
R> head(df2)
a c
1 1 2
2 2 4
3 3 5
4 4 3
5 5 3
6 6 8
R> merge(df1, df2)
a b c
1 1 -0.09852819 2
2 2 -0.47658118 4
3 3 -2.14825893 5
4 4 0.82216912 3
5 5 -0.36285430 3
Upvotes: 5