Reputation: 735
I have two dataframes:
id dates
MUM-1 2015-07-10
MUM-1 2015-07-11
MUM-1 2015-07-12
MUM-2 2014-01-14
MUM-2 2014-01-15
MUM-2 2014-01-16
MUM-2 2014-01-17
and:
id dates field1 field2
MUM-1 2015-07-10 1 0
MUM-1 2015-07-12 2 1
MUM-2 2014-01-14 4 3
MUM-2 2014-01-17 0 1
merged data:
id dates field1 field2
MUM-1 2015-07-10 1 0
MUM-1 2015-07-11 na na
MUM-1 2015-07-12 2 1
MUM-2 2014-01-14 4 3
MUM-2 2014-01-15 na na
MUM-2 2014-01-16 na na
MUM-2 2014-01-17 0 1
code: merge(x= df1, y= df2, by= 'id', all.x= T)
I am using merge but since the size of both dataframes are too huge, it is taking too long to process. Is there any alternative to the merge function? Maybe in dplyr? So that it processes fast in comparision. Both dataframes have more than 900K rows.
Upvotes: 7
Views: 58670
Reputation: 6496
You can convert both data frames to data tables, and then perform a merge:
library(data.table)
setDT(df1); setDT(df2)
merge(df1, df2, by = "id", allow.cartesian = TRUE)
the allow.cartesian
part allows a merge when there are repeated values in the key of any of the merged elements (allowing for a new table length greater than the max of the orignal elements, see ?data.table
.
Upvotes: 3
Reputation: 83275
Instead of using merge
with data.table
, you can also simply join as follows:
setDT(df1)
setDT(df2)
df2[df1, on = c('id','dates')]
this gives:
> df2[df1]
id dates field1 field2
1: MUM-1 2015-07-10 1 0
2: MUM-1 2015-07-11 NA NA
3: MUM-1 2015-07-12 2 1
4: MUM-2 2014-01-14 4 3
5: MUM-2 2014-01-15 NA NA
6: MUM-2 2014-01-16 NA NA
7: MUM-2 2014-01-17 0 1
Doing this with dplyr
:
library(dplyr)
dplr <- left_join(df1, df2, by=c("id","dates"))
As mentioned by @Arun in the comments, a benchmark is not very meaningfull on a small dataset with seven rows. So lets create some bigger datasets:
dt1 <- data.table(id=gl(2, 730, labels = c("MUM-1", "MUM-2")),
dates=c(seq(as.Date("2010-01-01"), as.Date("2011-12-31"), by="days"),
seq(as.Date("2013-01-01"), as.Date("2014-12-31"), by="days")))
dt2 <- data.table(id=gl(2, 730, labels = c("MUM-1", "MUM-2")),
dates=c(seq(as.Date("2010-01-01"), as.Date("2011-12-31"), by="days"),
seq(as.Date("2013-01-01"), as.Date("2014-12-31"), by="days")),
field1=sample(c(0,1,2,3,4), size=730, replace = TRUE),
field2=sample(c(0,1,2,3,4), size=730, replace = TRUE))
dt2 <- dt2[sample(nrow(dt2), 800)]
As can be seen, @Arun's approach is slightly faster:
library(rbenchmark)
benchmark(replications = 10, order = "elapsed", columns = c("test", "elapsed", "relative"),
jaap = dt2[dt1, on = c('id','dates')],
pavo = merge(dt1,dt2,by="id",allow.cartesian=T),
dplr = left_join(dt1, dt2, by=c("id","dates")),
arun = dt1[dt2, c("fiedl1", "field2") := .(field1, field2), on=c("id", "dates")])
test elapsed relative
4 arun 0.015 1.000
1 jaap 0.016 1.067
3 dplr 0.037 2.467
2 pavo 1.033 68.867
For a comparison on a large dataset, see the answer of @Arun.
Upvotes: 18
Reputation: 118879
I'd update df1
directly by reference as follows:
require(data.table) # v1.9.5+
setDT(df1)[df2, c("fiedl1", "field2") :=
.(field1, field2), on=c("id", "dates")]
> df1
# id dates fiedl1 field2
# 1: MUM-1 2015-07-10 1 0
# 2: MUM-1 2015-07-11 NA NA
# 3: MUM-1 2015-07-12 2 1
# 4: MUM-2 2014-01-14 4 3
# 5: MUM-2 2014-01-15 NA NA
# 6: MUM-2 2014-01-16 NA NA
# 7: MUM-2 2014-01-17 0 1
This'd be very memory efficient (and faster), as it doesn't copy the entire object just to add two columns, rather updates in place.
Updated with a slightly bigger dataset than @Jaap's updated benchmark:
set.seed(1L)
dt1 = CJ(id1 = paste("MUM", 1:1e4, sep = "-"), id2 = sample(1e3L))
dt2 = dt1[sample(nrow(dt1), 1e5L)][, c("field1", "field2") := lapply(c(1e3L, 1e4L), sample, 1e5L, TRUE)][]
# @Jaap's answers
system.time(ans1 <- setDT(dt2)[dt1, on = c('id1','id2')])
# user system elapsed
# 0.209 0.067 0.277
system.time(ans2 <- left_join(setDF(dt1), setDF(dt2), by = c("id1", "id2")))
# user system elapsed
# 119.911 0.530 120.749
# this answer
system.time(ans3 <- setDT(dt1)[dt2, c("field1", "field2") := list(field1, field2), on = c("id1", "id2")])
# user system elapsed
# 0.087 0.013 0.100
sessionInfo()
# R version 3.2.1 (2015-06-18)
# Platform: x86_64-apple-darwin13.4.0 (64-bit)
# Running under: OS X 10.10.4 (Yosemite)
# locale:
# [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
# attached base packages:
# [1] stats graphics grDevices utils datasets methods base
# other attached packages:
# [1] data.table_1.9.5 dplyr_0.4.2
# loaded via a namespace (and not attached):
# [1] magrittr_1.5 R6_2.1.0 assertthat_0.1 parallel_3.2.1 DBI_0.3.1
# [6] tools_3.2.1 Rcpp_0.12.0 chron_2.3-45
Dint expect dplyr
to be ~1200x slower though.
Upvotes: 10
Reputation: 16121
I think the quickest solution at the moment for those cases (huge datasets) is the data.table merging after setting the keys first.
You can also use dplyr
's left_join
with data.frames, as mentioned before, but it would be good to compare the same command after transforming your data.frames to data.tables. In other words, use dplyr
with a data.table structure in the background.
As an example I'll create two datasets, then save them as a data.frame, data.table with a key and data.table without a key. Then I'll perform various merges and count the time:
library(data.table)
library(dplyr)
# create and save this dataset as a data.frame and as a data.table
list = seq(1,500000)
random_number = rnorm(500000,10,5)
dataT11 = data.table(list, random_number, key="list") # data.table with a key
dataT12 = data.table(list, random_number) # data.table without key
dataF1 = data.frame(list, random_number)
# create and save this dataset as a data.frame and as a data.table
list = seq(1,500000)
random_number = rnorm(500000,10,5)
dataT21 = data.table(list, random_number, key="list")
dataT22 = data.table(list, random_number)
dataF2 = data.frame(list, random_number)
# check your current data tables (note some have keys)
tables()
# merge the datasets as data.frames and count time
ptm <- proc.time()
dataF3 = merge(dataF1, dataF2, all.x=T)
proc.time() - ptm
# merge the datasets as data.tables by setting the key now and count time
ptm <- proc.time()
dataT3 = merge(dataT12, dataT22, all.x=T, by = "list")
proc.time() - ptm
# merge the datasets as data.tables on the key they have already and count time
ptm <- proc.time()
dataT3 = merge(dataT11, dataT21, all.x=T)
proc.time() - ptm
# merge the datasets as data.tables on the key they have already and count time (alternative)
ptm <- proc.time()
dataT3 = dataT11[dataT21]
proc.time() - ptm
# merge the datasets as data.frames using dplyr and count time
ptm <- proc.time()
dataT3 = dataF1 %>% left_join(dataF2, by="list")
proc.time() - ptm
# merge the datasets as data.tables using dplyr and count time
ptm <- proc.time()
dataT3 = dataT11 %>% left_join(dataT21, by="list")
proc.time() - ptm
Upvotes: 1