Reputation: 1756
In SAS there's an method of creating Library (using LIBNAME). This is helpful as when we have to do long data processing, we don't change always the dataset name. So, if we want to use a dataset again, without changing the name, we can put in a library. So, even if the dataset name are same, but since they are in different libraries, we can work on them together.
My question is there any such option in R that can create Library (or separate folder within R) so that we can save our data there?
Here's the example:
Suppose I've a dataset "dat1". I summarize variables in dat1 var1 & var2 for var3.
proc summary data=dat1 nway missing;
var var1 var2;
class var3;
output out=tmp.dat1 (drop = _freq_ _type_) sum = ;
run;
Then I merged dat1 with dat2, which is another dataset.Both dat1 & dat2 has common variable var3, with which I merged. I created new dataset dat1 again.
proc sql;
create table dat1 as
select a.*,b.*
from dat1 a left join tmp.dat2 b
on a.var3=b.var3;
quit;
Now, I'm again summarizing dataset dat1 after merging to check if the values of var1 & var 2 remain the same before & after merging.
proc summary data=dat1 nway missing;
var var1 var2;
class var3;
output out=tmp1.dat1 (drop = _freq_ _type_) sum = ;
run;
The equivalent code in R will be
dat3 <- ddply(dat1,
.(var3),
summarise,
var1 = sum(var1,na.rm=TRUE),
var2 = sum(var2,na.rm=TRUE))
dat1 <- sqldf("select a.*,b.*
from dat1 a
left join dat2 b
on a.var3=b.var3")
dat4 <- ddply(dat1,
.(var3),
summarise,
var1 = sum(var1,na.rm=TRUE),
var2 = sum(var2,na.rm=TRUE))
In case of SAS I used just 2 dataset name. But in case of R, I'm using 4 dataset name. So, if I'm writing 4000 line code for data processing, having too many dataset name sometimes become overwhelming. In sas it became easy to have same dataset name as I'm using 2 libraries tmp, tmp1 other than the default work library.
In SAS, library is defined as:
LIBNAME tmp "directory_path\folder_name";
In this folder, dat1 will be stored.
Upvotes: 3
Views: 3083
Reputation: 115485
Here is an example using the SOAR
package and named environments
To quote from the vignette
Objects need not be always held in memory. The function save may be used to save objects on the disc in a file, typically with an .RData extension. The objects may then be removed from memory and later recalled explicitly with the load function.
The SOAR package provides simple way to store objects on the disc, but in such a way that they remain visible on the search path as promises, that is, if and when an object is needed again it is automatically loaded into memory. It uses the same lazy loading mechanism as packages, but the functionality provided here is more dynamic and exible
It will be useful to read the whole vignette
library(SOAR)
library(plyr)
library(sqldf)
set.seed(1)
# create some dummy data create a named envirment
tmp <- new.env(parent = .GlobalEnv)
dat1 <- data.frame(var1 = rnorm(50),
var2 = sample(50, replace = TRUE),
var3 = sample(letters[1:5], 50, replace = TRUE))
tmp$dat1 <- ddply(dat1, .(var3), summarise,
var1 = sum(var1, na.rm = TRUE),
var2 = sum(var2, na.rm = TRUE))
tmp$dat2 <- data.frame(Var3 = sample(letters[1:5], 20, replace = TRUE),
Var4 = 1:20)
# store as a SOAR cached object (on disc)
Store(tmp, lib = "tmp")
# replace dat1 within the global enviroment using sqldf create a new
# environment to work in with the correct version of dat1 and dat2
sqlenv <- tmp
sqlenv$dat1 <- dat1
dat1 <- sqldf("select a.*,b.* from dat1 a left join dat2 b on a.var3=b.var3",
envir = sqlenv)
# create a new named enviroment tmp1
tmp1 <- new.env(parent = .GlobalEnv)
tmp1$dat1 <- ddply(dat1, .(var3), summarise,
var1 = sum(var1, na.rm = TRUE),
var2 = sum(var2, na.rm = TRUE))
# store using a SOAR cache
Store(tmp1, lib = "tmp")
tmp1$dat1
## var3 var1 var2
## 1 a 1.336 378
## 2 b 8.514 1974
## 3 c 5.795 624
## 4 d -8.828 936
## 5 e 20.846 1490
tmp$dat1
## var3 var1 var2
## 1 a 0.4454 126
## 2 b 1.4190 329
## 3 c 1.9316 208
## 4 d -2.9427 312
## 5 e 4.1691 298
I'm not sure you should expect tmp1$dat1
and tmp$dat1
to be identical (given my example anyway)
Upvotes: 4
Reputation: 8272
Named environments are one of a number of ways of achieving what it sounds like you want.
Personally, if there aren't a lot of different data frames or lists, I'd lean toward organizing it other ways, such as inside either data frames or lists, depending on how your data is structured. But if each thing consists of many different kinds of data and functions, environments may be significantly better. They're described in the help, and a number of posts to r-blogs discuss them.
But on reflection, R-Studio projects may be closer to the way you're thinking about the problem (and if you're not using R-Studio already, I highly recommend it). Have a look at how projects work.
Upvotes: 2
Reputation: 58855
There are two separate aspects of SAS's libraries which (it seems) you are interested in.
Taking these in that order.
The problem with answering the first is that R and SAS have different models for how data is stored. R stores data in memory, organized in environments arranged in a particular search order. SAS stores data on disk and the names of datasets correspond to file names within a specified directory (there likely is caching in memory for optimization, but conceptually this is how data is stored). R can store (sets of) objects in a file on disk using save()
and bring them back into memory using load()
. The filename and directory can be specified in those function calls (hence Paul's answer). You could have several .RData
files, each containing objects named dat1
, dat2
, etc. which can be loaded prior to running an analysis and the results can be written out to (other) .RData
files.
An alternative to this would be using one of the extensions which give data types which are backed by disk storage instead of memory. I've not had experience with any of them to talk about how well they would work in this situation, but that is an option. [Edit: mnel's answer has a detailed example of just this idea.]
Your second part can be approached different ways. Since R uses in-memory data, the answers would focus around arranging different environments (each of which can contain different but identically named data sets) and controlling which one gets accessed via attach()
ing and detach()
ing the environments from the search path (what Glen_b's answer gets toward). You still don't have the disk backing of the data, but that is the previous problem.
Finally, @joran's admonition is relevant. The solution to the problem of performing a set of tasks on potentially different (but related) sets of data in R is to write a function to do the work. The function has parameters. Within the function, the parameters are referred to by the names given in the argument list. When the function is called, which particular set of data is sent to it specified by the function call; the names inside and outside the function need not have anything to do with each other. The suggestions about storing the multiple sets of data in a list are implicitly approaching the problem this way; the function is called for each set of data in the list in turn. Names don't matter, then.
Upvotes: 5
Reputation: 60984
From what I can gather from the SAS onlinehelp, a SAS library is a set of datasets that is stored in a folder, and can be referenced as a unit. The equivalent in R would be to store the R objects you want to save using save
:
save(obj1, obj2, etc, file = "stored_objects.rda")
Loading the objects can be done using load
.
edit: I dont really see why having an additional object or two is so much of a problem. However, if you want to reduce tge amount of object just put your results in a list
.
Upvotes: 6