Reputation: 2770
I am trying to understand if I should use data.table
or base r to merge data.tables
. These two methods produce an identical number of rows and cols and the same variables classes, but the identical
function returns false. I am trying to understand what is different between these two methods.
library( data.table )
a <- data.frame(
id = 1:10000000,
var1 = sample(letters , 10000000, replace=T ),
var2 = sample(letters , 10000000, replace=T ),
var3= sample(letters , 10000000, replace=T )
)
b <- data.frame(
id = 1:10000000,
var4 = sample(letters , 10000000, replace=T ),
var5 = sample(letters , 10000000, replace=T ),
var6= sample(letters , 10000000, replace=T )
)
a <- data.table( a )
b <- data.table( b )
system.time( dts <- a[b, on = .(id )] )
system.time( base <- merge( a , b, by = c("id") ) )
# returns FALSE
identical( dts , base )
# BUT the classes and dims are the same
sapply( dts , class )
sapply( base , class )
dim( base )
dim( dts )
Upvotes: 1
Views: 57
Reputation: 173517
The base
version has an additional attribute called sorted
. This attribute is created by the default behavior of merge
, if you do:
base <- merge( a , b, by = c("id"),sort = FALSE)
they are identical.
Upvotes: 3