Rich Scriven
Rich Scriven

Reputation: 99331

Are there facilities to automate updates in package data?

For packages that rely on research data, are there facilities in R that runs a periodic check to determine if an update has been made, or if there now exists a difference between the website and the package?

Here's a reproducible example. The following function returns the set of years for which data exists on the webpage. The data itself is available to be downloaded from the site, and is done so in a different function. check will be a way to handle errors in argument matching in that other function.

check <- function () 
{
    doc <- htmlTreeParse("http://www.retrosheet.org/events",   
                         useInternalNodes = TRUE)
    on.exit(free(doc))
    xv <- sapply(doc["//a"], xmlValue)
    gg <- xv[grepl("eve.zip", xv, fixed = TRUE)]
    res <- gsub("(s?)eve.zip", "", gg)
    as.numeric(unique(res))
}

A subset of the result is

> library(XML)
> check()[1:5]
# [1] 1920 1921 1922 1927 1930

Notice that this is not a sequence. Data may be added to the site later, and new years may appear in the result. If I store the available years as a package object, I will not know that an update has been made to the data.

A profile of speed is

> system.time({ check() })
#   user  system elapsed 
#  0.060   0.003   0.875

which is not slow, but the function that uses this check could be made more efficient without it, since it also goes to same site to download the data after check makes sure it exists.

Upvotes: 1

Views: 57

Answers (2)

Dirk is no longer here
Dirk is no longer here

Reputation: 368201

Many moons ago, I actually wrote the digest package for a very similar task.

And as it turns out, many package have similar needs in terms of comparison of data sets which is how the digest package ended up being used by many of the data caching packages, or by knitr to see if chunks changed etc pp.

So if you have your data in R, consider comparing a digest checksum. That said, this still requires accessing the before and after data sets to determine if they have changed or not -- and there may be other ways. But digest provides a relatively widely used comparison method based on checksums of complete object serialization.

Upvotes: 3

Thomas
Thomas

Reputation: 44525

You could take a look at my UNF package, which implements the Universal Numeric Fingerprint algorithm for datasets. It's primarily designed for data.frames but works on vectors as well:

> library("UNF")
> unf(z)
Universal Numeric Fingerprint (Truncated): UNF:5:Iw2Mw/fiLQ+OzNrOtolwFw==

Just like @DirkEddelbuettel's answer, this is a hash of the original data (and relies on library("digest") under the hood). The algorithm is described on GitHub.

In short, if the UNF signature changes, data has changed. If you're working with dataframes, there are also some functions in the package for comparing dataframes and finding what variables/columns differ.

Upvotes: 2

Related Questions