larenite
larenite

Reputation: 287

How can I reduce installation time for a data R package I am building?

Alternative wording of Q: What factors determine walltime for the lazyload DB step of R package installation?

I am developing an R package to make it easy for users to access consortium data. The size of the data directory is 4.5G, where all objects are compressed with bzip2. There are 202 individual .RData files, ranging from 133B to 24MB compressed.

When I install the package, the output looks something like this:

Downloading GitHub repo mypackage@HEAD
✓  checking for file ‘.../mypackage-1c6478a/DESCRIPTION’ ...
─  preparing ‘mypackage’:
✓  checking DESCRIPTION meta-information
─  checking for LF line-endings in source and make files and shell scripts
─  checking for empty or unneeded directories
─  building ‘mypackage_1.0.0.tar.gz’ (1.3s)
   
* installing *source* package ‘mypackage’ ...
** using staged installation
** R
** data
*** moving datasets to lazyload DB
** inst
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded from temporary location
** testing if installed package can be loaded from final location
** testing if installed package keeps a record of temporary installation path
* DONE (mypackage)

The *** moving datasets to lazyload DB step takes the longest, about 5 minutes. What dictates walltime for this step? Number of objects? Size of objects? File compression? Is there anything I can do to make it install faster?

EDIT: I do want all of the R objects to be lazy-loaded, and I want them all to have accompanying documentation, so I believe the best practice is to keep the .rda files in data/. I am specifically wondering if there is a way to speed up the lazy-loading step when the package is being installed.

Upvotes: 5

Views: 527

Answers (1)

Anders Ellern Bilgrau
Anders Ellern Bilgrau

Reputation: 10223

I suppose you're adding your data to the /data folder. An alternative would to be to put it under inst/; for example inst/extdata/ and the make the available for loading through functions in your package by using the path given for system.file("extdata/mydataset.Rds", package="foo"). Then you'll need to call that helper function to get your data.

I.e. something like this for a specific dataset:

loadPackageData <- function() {
  readRDS(system.file("extdata/foo.Rds", package="bar"))
}

Edit: To load multiple datasets you could do:

bar_data_files <- list.files(system.file("extdata", package = "bar"),
                       pattern = "\\.Rds$", full.names = TRUE)
barData <- setNames(lapply(bar_data_files, function(f) readRDS(f)), 
                    tools::file_path_sans_ext(basename(bar_data_files)))

# Then to get the foo dataset:
barData$foo()

# Or view the dataset names:
names(barData)

Auto completion would also work here.

A more conventional approach could be:

loadBarData <- function(dataset) {
  bar_data_files <- list.files(".", #system.file("extdata", package = "bar"),
                               pattern = "\\.Rds$", full.names = TRUE)
  files_sans_ext <- tools::file_path_sans_ext(basename(bar_data_files))
  if (missing(dataset)) {
    print(files_sans_ext)
  } else {
    if (!(dataset %in% files_sans_ext)) {
      stop("Could not match dataset", dataset)
    } else { 
      readRDS(bar_data_files[match(dataset, files_sans_ext)]) 
    }
  }
}

loadBarData() # List all available datasets
loadBarData("foo") # Loads "foo" is foo is found

You can of course expand upon this and define what you want to happen if you ask for multiple datasets (in a vector) and so on (e.g. combining in a list or combining the datasets into one if they are similar).

One can also imagine alternatives if the datasets are systematically named.

In any case, the idea is that you make it load on a function call.

Upvotes: 1

Related Questions