MatthewR
MatthewR

Reputation: 2770

Merging/Joining with data.table not identical to merge function

I am trying to understand if I should use data.table or base r to merge data.tables. These two methods produce an identical number of rows and cols and the same variables classes, but the identical function returns false. I am trying to understand what is different between these two methods.

library( data.table )

a <- data.frame( 
    id = 1:10000000,
    var1 = sample(letters , 10000000,  replace=T ),
    var2 = sample(letters , 10000000,  replace=T ),
    var3= sample(letters , 10000000,  replace=T )
)

b <- data.frame( 
    id = 1:10000000,
    var4 = sample(letters , 10000000,  replace=T ),
    var5 = sample(letters , 10000000,  replace=T ),
    var6= sample(letters , 10000000,  replace=T )
)


a <- data.table( a )
b <- data.table( b )

system.time( dts <- a[b, on = .(id )] )
system.time( base <- merge( a , b, by = c("id") ) )

# returns FALSE
    identical( dts , base )

# BUT the classes and dims are the same
    sapply( dts , class  )
    sapply( base , class  )

    dim( base )
    dim( dts )

Upvotes: 1

Views: 57

Answers (1)

joran
joran

Reputation: 173517

The base version has an additional attribute called sorted. This attribute is created by the default behavior of merge, if you do:

base <- merge( a , b, by = c("id"),sort = FALSE)

they are identical.

Upvotes: 3

Related Questions