lost
lost

Reputation: 1669

compare md5 of an object in memory to the md5 of it as an .Rds

I'd like a way to compare the md5sum of an object in memory to that of an .Rds file of the same object and get identical hashes.

My naive attempt, which does not produce identical hashes:

library(tools)
library(digest)

digest(c(1,2,3), algo = "md5")
#> [1] "af9e5c24af013c970922362b8850b060"

saveRDS(c(1,2,3), "123.Rds")

digest("123.Rds", algo = "md5", file = TRUE)
#> [1] "efb450974fefa662a54d1a2563a4f03b"
md5sum("123.Rds") # redundant
#>                            123.Rds 
#> "efb450974fefa662a54d1a2563a4f03b"

I'm aware that serialize() includes the R version in the serialization, so if saveRDS() is doing this too, which I suspect it is, then maybe what I am looking for is a saveRDS() that doesn't do that? I may work on projects across multiple R versions, and I'd like a solution here that will give identical results across (non-ancient) versions of R.

Upvotes: 0

Views: 435

Answers (1)

MrFlick
MrFlick

Reputation: 206253

So the digest of the rds file will never quite match the digest of the actual data. The rds file typically compresses data and will add metadata information like the file format version so the bytes are different and hence the digest is different.

Instead, considering writing out the digest of the data to a separate file at the same time you write the rds. Then, if you want to compare data later, you can just read the cached digest rather than having to reload the original data. For example

save_rds <- function(x, rdsname = paste0(substitute(x), ".rds"), digestname = paste0(rdsname,".digest")) {
  hash <- digest::digest(x, algo = "md5")
  saveRDS(x, rdsname)
  writeLines(hash, digestname)
}

You can all this function like

save_rds(iris)

and it will write out iris.rds and iris.rds.digest.

Then, if you want to compare data to a hash at some later, point, you can have a helper function like

digest_match <- function(x, digestfile) {
  hash <- digest::digest(x, algo = "md5")
  orig_hash <- readLines(digestfile)
  return( hash==orig_hash )
}

And can test it with

digest_match(iris, "iris.rds.digest")
# [1] TRUE
iris[1,1] <- 10
digest_match(iris, "iris.rds.digest")
# [1] FALSE

Upvotes: 1

Related Questions