btsai-dev
btsai-dev

Reputation: 173

hash - Identical R Dataframes, different hashes (not an attribute problem)

I have two dataframes of ~150 rows of X and Y where identical(X, Y) is TRUE but identical(digest(X), digest(Y)) is FALSE. I'm looking into why this is the case.

I did look at this answer and re-ran what they tested, with similar results, but unlike their problem, the attributes for my dataframes are the same. Testing results:

> names(attributes(X))
[1] "names"     "row.names" "class"
> names(attributes(Y))
[1] "names"     "row.names" "class"  

> digest(X)
[1] "07b7ef11ce6eaae01ddd79e4facef581"
> digest(Y)
[1] "09d8abcab0af0a72265a9b690f4eacc3"

> digest(X[1:nrow(X),])
[1] "2f338de9972529bd2bc9c39c3c5762ea"
> digest(Y[1:nrow(Y),])
[1] "2f338de9972529bd2bc9c39c3c5762ea"

> identical(X, Y, attrib.as.set=FALSE)
[1] TRUE

I also saved the dataframes as .RDS files, and re-read them in.

> X_rds <- read_rds("cache_vars/X.rds")
> Y_rds <- read_rds("cache_vars/Y.rds")
> identical(X_rds , Y_rds )
[2] TRUE
> digest(X_rds)
[2] "07b7ef11ce6eaae01ddd79e4facef581"
> digest(Y_rds )
[2] "09d8abcab0af0a72265a9b690f4eacc3"
> identical(X_rds , Y_rds , attrib.as.set=FALSE)
[2] TRUE

And like the other poster, converting to matrices and back to dataframe yielded identical digests, so it's probably some structural problem.

> X_Mat <- as.matrix(X_rds)
> Y_Mat <- as.matrix(Y_rds)
> identical(digest(X_Mat), digest(Y_Mat))
[2] TRUE
> X_DF <- as.data.frame(X_Mat)
> Y_DF <- as.data.frame(Y_Mat)
> identical(digest(X_DF ), digest(Y_DF))
[2] TRUE

Dataframe X was produced from a parallel-designed loop (but with the %do% flag so no actual parallelism was done) and Y was produced from a sequential loop.

The .RDS files for X and Y can be found at this link.

Update: MrFlick has it right. As it turns out, the serialization during parallel's rbind function was also adding the gp=0x20 flag, similar to what they described occurs when writing to RDS.

Upvotes: 2

Views: 63

Answers (1)

MrFlick
MrFlick

Reputation: 206197

When you write to rds, the objects are serialized. The serialization contains some information in addition to just the values the vectors contain. Note that if we just compare all the columns, they produce a different digests

sapply(seq_along(X_rds), function(i)
  digest::digest(X_rds[[i]])==digest::digest(Y_rds[[i]])
)

So the vectors that are being stored in the data.frame are different. We can use the internal inspect function to get some of the meta-data for the vectors

.Internal(inspect(X_rds[[1]]))
# @135305a00 14 REALSXP g0c7 [REF(4),gp=0x20] (len=150, tl=0) 
# 1.009e+06,1.009e+06,1.009e+06,1.009e+06,1.009e+06,...
.Internal(inspect(Y_rds[[1]]))
# @115dbfc00 14 REALSXP g0c7 [REF(29)] (len=150, tl=0) 
# 1.009e+06,1.009e+06,1.009e+06,1.009e+06,1.009e+06,...

So we see they differ in the [] parts. I believe the REF() number represents the reference count to that object for memory clearing purposes. I do not believe that this number is used in the serialization. But the X_rds also has gp=0x20 set. The "gp" stands for "general purpose" bits/flags. I believe in this case it means the GROWABLE_MASK was set on that object. These values are preserved when the object is serialized which is the default behavior for digest. Thus these vectors do not have the exact same serialization due to this flag difference.

Another way to see the difference is to look at the desrialization

substring(rawToChar(serialize(X_rds[[1]], connection = NULL, ascii = TRUE)), 1, 45)
# [1] "A\n3\n262657\n197888\n5\nUTF-8\n131086\n150\n1009002\n"
substring(rawToChar(serialize(Y_rds[[1]], connection = NULL, ascii = TRUE)), 1, 45)
# [1] "A\n3\n262657\n197888\n5\nUTF-8\n14\n150\n1009002\n1009"

We have a a bit of a header, then we start to see the values being output. There is one value where there is a difference and that's where X has 131086 (0x20000E) and Y has 14 (0xE). Those differences are due to the flags where are written here in the R source code.

When you use identical, only the values in the data.frame are compared, not the additional metadata.

If you wanted to get around this, you could write your own wrapper around digest that avoids the serialization. For example

dfdigest <- function(x) {
  charsToRaw <- function(x) unlist(lapply(x, charToRaw))
  bytes <- unlist(c(list(charsToRaw(names(x))), 
                    lapply(x, function(col) {
    if (typeof(col)=="double") writeBin(col, raw())
    else if (typeof(col)=="character") charsToRaw(col)
    else stop(paste("unconfigured data type:", typeof(col)))
  })))
  digest::digest(bytes, serialize = FALSE)
}

dfdigest(X_rds)
# [1] "2488505e3ad1a370d030b539a287b7ca"
dfdigest(Y_rds)
# [1] "2488505e3ad1a370d030b539a287b7ca"

Upvotes: 2

Related Questions