Reputation: 110
I am trying to merge two datasets together in R based on two criteria. They have to have the same id and year. One of the vector has the size of about 10000 and the other 2000. I think if I do a two level one by one search, the computing time would explode. The data is sorted by id and year. Is there a more efficient search algorithm than the naive comparison ?
Upvotes: 0
Views: 451
Reputation: 3280
There are many solutions to this problem, e.g. by merge, by indexing, by looping (as you said).
However, the most elegant solution is by using the data.table
package, which is really fast for managing data sets, and can be considered an evolved version of data.frame
.
Let us first set up the data: Based on the limited information that you have provided in the question, here is a dummy attempt to solve the problem.
install.packages("data.table")
library(data.table)
set.seed(100)
dt1 <- data.table(
id = 1:10000,
Year = sample(1950:2014,size=10000,replace = TRUE),
v1 = runif(10000)
)
head(dt1)
dt2 <- data.table(
id = sample(1:10000,2000),
Year = sample(1950:2014,size=2000,replace = TRUE),
v2 = runif(2000),
v3 = runif(2000)
)
head(dt2)
Once data is set up, remaining part is very simple.
Step1: Set the keys.
setkey(dt1,id,Year) # Set keys in first table
setkey(dt2,id,Year) # Set keys in second table
Step2: Merge which ever way you want.
dt1[dt2,nomatch=0]
dt2[dt1,nomatch=0]
The time taken to merge the data is about 0.02 second. This works extremely fast for very large data-sets as well.
system.time(dt1[dt2,nomatch=0]) # 0.02 sec
system.time(dt2[dt1,nomatch=0]) # 0.02 sec
To learn more about data.table
?example(data.table)
Hope this helps!!
If not, plz post more details!!
Upvotes: 3