Graeme Walsh
Graeme Walsh

Reputation: 678

How to put datasets into an R package

I am creating my own R package and I was wondering what are the possible methods that I can use to add (time-series) datasets to my package. Here are the specifics:

I have created a package subdirectory called data and I am aware that this is the location where I should save the datasets that I want to add to my package. I am also cognizant of the fact that the files containing the data may be .rda, .txt, or .csv files.

Each series of data that I want to add to the package consists of a single column of numbers (eg. of the form 340 or 4.5) and each series of data differs in length.

So far, I have saved all of the datasets into a .txt file. I have also successfully loaded the data using the data() function. Problem not solved, however.

The problem is that each series of data loads as a factor except for the series greatest in length. The series that load as factors contain missing values (of the form '.'). I had to add these missing values in order to make each column of data the same in length. I tried saving the data as unequal columns, but I received an error message after calling data().

A consequence of adding missing values to get the data to load is that once the data is loaded, I need to remove the NA's in order to get on with my analysis of the data! So, this clearly is not a good way of doing things.

Ideally (I suppose), I would like the data to load as numeric vectors or as a list. In this way, I wouldn't need the NA's appended to the end of each series.

How do I solve this problem? Should I save all of the data into one single file? If so, in what format should I do it? Perhaps I should save the datasets into a number of files? Again, in which format? What is the best practical way of doing this? Any tips would greatly be appreciated.

Upvotes: 16

Views: 6604

Answers (4)

stevec
stevec

Reputation: 52867

You'll need to create the data file and include it in the R package, and you may want to also document it. Here's how to do both.

Create the data file and include it in R package

  • Create a directory inside the package called /data and place any data in it. Use only .rda and .RData files.
  • When creating the rda/RData file from an R object, make sure the R object is named what you want it to be named when it's used in the package and use save() to create it. Example:
save(river_fish, file = "data/river_fish.rda", version = 2)
  • Add this on a new line in the file called DESCRIPTION:
LazyData: true

Documenting the dataset

Document the dataset by placing a string with the dataset name after the documentation:

#' This is data to be included in my package
#'
#' @author My Name \email{blahblah@@roxygen.org}
#' @references \url{data_blah.com}
"data-name"

Here and here are some nice examples from dplyr.


Notes

  • To access the data in the package, run river_fish or whatever the name of the dataset is. Nothing more is needed.

  • Using version = 2 when calling save() ensures your data object is available for older R versions (i.e. prior to 3.5.0) i.e. it will prevent this warning:

WARNING: Added dependency on R >= 3.5.0 because serialized objects in serialize/load version 3 cannot be read in older versions of R.

  • No need to use load() in the R package (just call the object directly instead e.g. river_fish will be enough to yield the data from data/river_fish.rda), but in the event you do wish to load an rda/RData file for some reason (e.g. playing around or testing), this will do it:
load("data/river_fish.rda")

Upvotes: 1

epo3
epo3

Reputation: 3121

Preferred saving location of your data depends on its format.

As Hadley suggested:

  • If you want to store binary data and make it available to the user, put it in data/. This is the best place to put example datasets.
  • If you want to store parsed data, but not make it available to the user, put it in R/sysdata.rda. This is the best place to put data that your functions need.
  • If you want to store raw data, put it in inst/extdata.

I suggest you have a look at the linked chapter as it goes into detail about working with data when developing R packages.

Upvotes: 1

IRTFM
IRTFM

Reputation: 263481

In addition to saving as rda files you could also choose to load them as numeric with:

 read.table( ... , colClasses="numeric")

Or as non-factor-text:

 read.table( ..., as.is=TRUE) # which does pretty much the same as stringsAsFactors=FALSE
 read.table( ..., colClasses="character")

It also appears that the data function would accept these arguments sinc it is documented to be a simple wrapper for read.table(..., header=TRUE).

Upvotes: 4

user1265067
user1265067

Reputation: 897

I'm not sure if I understood your question correctly. But, if you edit your data in your favorite format and save with

save(myediteddata, file="data.rda")

The data should be loaded exactly the way you saw it in R.

To load all files in data directory you should add

LazyData: true

To your DESCRIPTION file, in your package.

If this don't help you could post one of your files and a print of the format you want, this will help us to help you ;)

Upvotes: 10

Related Questions