Reputation: 99331
For packages that rely on research data, are there facilities in R that runs a periodic check to determine if an update has been made, or if there now exists a difference between the website and the package?
Here's a reproducible example. The following function returns the set of years for which data exists on the webpage. The data itself is available to be downloaded from the site, and is done so in a different function. check
will be a way to handle errors in argument matching in that other function.
check <- function ()
{
doc <- htmlTreeParse("http://www.retrosheet.org/events",
useInternalNodes = TRUE)
on.exit(free(doc))
xv <- sapply(doc["//a"], xmlValue)
gg <- xv[grepl("eve.zip", xv, fixed = TRUE)]
res <- gsub("(s?)eve.zip", "", gg)
as.numeric(unique(res))
}
A subset of the result is
> library(XML)
> check()[1:5]
# [1] 1920 1921 1922 1927 1930
Notice that this is not a sequence. Data may be added to the site later, and new years may appear in the result. If I store the available years as a package object, I will not know that an update has been made to the data.
A profile of speed is
> system.time({ check() })
# user system elapsed
# 0.060 0.003 0.875
which is not slow, but the function that uses this check could be made more efficient without it, since it also goes to same site to download the data after check
makes sure it exists.
Upvotes: 1
Views: 57
Reputation: 368201
Many moons ago, I actually wrote the digest package for a very similar task.
And as it turns out, many package have similar needs in terms of comparison of data sets which is how the digest package ended up being used by many of the data caching packages, or by knitr to see if chunks changed etc pp.
So if you have your data in R, consider comparing a digest checksum. That said, this still requires accessing the before and after data sets to determine if they have changed or not -- and there may be other ways. But digest provides a relatively widely used comparison method based on checksums of complete object serialization.
Upvotes: 3
Reputation: 44525
You could take a look at my UNF package, which implements the Universal Numeric Fingerprint algorithm for datasets. It's primarily designed for data.frames but works on vectors as well:
> library("UNF")
> unf(z)
Universal Numeric Fingerprint (Truncated): UNF:5:Iw2Mw/fiLQ+OzNrOtolwFw==
Just like @DirkEddelbuettel's answer, this is a hash of the original data (and relies on library("digest")
under the hood). The algorithm is described on GitHub.
In short, if the UNF signature changes, data has changed. If you're working with dataframes, there are also some functions in the package for comparing dataframes and finding what variables/columns differ.
Upvotes: 2