lokheart
lokheart

Reputation: 24665

using hash to determine whether 2 dataframes are identical (PART 01)

I have created a dataset using WHO ATC/DDD Index a few months before and I want to make sure if the database online remains unchanged today, so I downloaded it again and try to use the digest package in R to do the comparison.

The two dataset (in txt format) can be downloaded here. (I am aware that you may think the files are unsafe and may have virus, but I don't know how to generate a dummy dataset to replicate the issue I have now, so I upload the dataset finally)

And I have written a little script as below:

library(digest)

ddd.old <- read.table("ddd.table.old.txt",header=TRUE,stringsAsFactors=FALSE)
ddd.new <- read.table("ddd.table.new.txt",header=TRUE,stringsAsFactors=FALSE)


ddd.old[,"ddd"] <- as.character(ddd.old[,"ddd"])
ddd.new[,"ddd"] <- as.character(ddd.new[,"ddd"])

ddd.old <- data.frame(ddd.old, hash = apply(ddd.old, 1, digest),stringsAsFactors=FALSE)
ddd.new <- data.frame(ddd.new, hash = apply(ddd.new, 1, digest),stringsAsFactors=FALSE)

ddd.old <- ddd.old[order(ddd.old[,"hash"]),]
ddd.new <- ddd.new[order(ddd.new[,"hash"]),]

And something really interesting happens when I do the checking:

> table(ddd.old[,"hash"]%in%ddd.new[,"hash"]) #line01

TRUE 
 506 
> table(ddd.new[,"hash"]%in%ddd.old[,"hash"]) #line02

TRUE 
 506 
> digest(ddd.old[,"hash"])==digest(ddd.new[,"hash"]) #line03
[1] TRUE
> digest(ddd.old)==digest(ddd.new) #line04
[1] FALSE

What happen? Both dataframe with the identical rows (from line01 and line02), same order (from line03), but are different? (from line04)

Or do I have any misunderstanding about digest? Thanks.

Upvotes: 5

Views: 268

Answers (1)

Richie Cotton
Richie Cotton

Reputation: 121057

Read in data as before.

ddd.old <- read.table("ddd.table.old.txt",header=TRUE,stringsAsFactors=FALSE)
ddd.new <- read.table("ddd.table.new.txt",header=TRUE,stringsAsFactors=FALSE)
ddd.old[,"ddd"] <- as.character(ddd.old[,"ddd"])
ddd.new[,"ddd"] <- as.character(ddd.new[,"ddd"])

Like Marek said, start by checking for differences with all.equal.

all.equal(ddd.old, ddd.new)
[1] "Component 6: 4 string mismatches" 
[2] "Component 8: 24 string mismatches"

So we just need to look at columns 6 and 8.

different.old <- ddd.old[, c(6, 8)]   
different.new <- ddd.new[, c(6, 8)]

Hash these columns.

hash.old <- apply(different.old, 1, digest)
hash.new <- apply(different.new, 1, digest)

And find the rows where they don't match.

different_rows <- which(hash.old != hash.new)  #which is optional

Finally, combine the datasets.

cbind(different.old[different_rows, ], different.new[different_rows, ])

Upvotes: 4

Related Questions