Ximinez
Ximinez

Reputation: 23

Making multiple named data frames with loop

In the process of learning. Didn't ask my first question well, so I'm trying again and doing my best to be more clear.

I'm trying to create a series of data frames for a reproducible question for my larger issue. I would like to make 4 data frames, each named differently by the year. Eventually I will merge these four data frames to explain where I am encountering my issue.

Here is the most recent solution. This runs, but instead creates a list of four data frames without any frames in the global directory.

 datafrom <- list()
 years <- c(2006,2008,2010,2012)

 for (i in 1:length(years)) {
  UniqueID <- 1:10 # <- Not all numeric - Kept as character vector
  Name <- LETTERS[seq( from = 1, to = 10 )]
  Entity_Type <- factor("This","That")
  Data1 <- rnorm(10)     
  Data2 <- rnorm(10) 
  Data3 <- rnorm(10) 
  Data4 <- rnorm(10) 
  Year <- years[i]
  datafrom[[i]] <- data.frame(UniqueID, Name, Entity_Type, Data1, Data2, Data3, Data4, Year)
 }

I would like 4 separate data frames, each named datafrom2006, datafrom2008, etc.

Many thanks in advance for your patience with my learning.

Upvotes: 0

Views: 389

Answers (1)

r2evans
r2evans

Reputation: 160447

I'll demonstrate a few (of many) techniques here, and I'll call them (1) brute force, (2) list-based, and (3) single long-form data.frame.

I'll add to the example the use of a function that you want to apply to each data.frame. Though contrived, it helps makes the point:

## some constants used throughout
years <- c(2006, 2008, 2010, 2012)
n <- 10
myfunc <- function(x) {
    interestingPart <- x[ , grepl('^Data', colnames(x)) ]
    sapply(interestingPart, mean)
}

Brute Force

Yes, you can create multiple like-named and same-structure data.frames from a loop, though it is typically frowned upon by many experienced (R?) programmers:

set.seed(42)
for (yr in years) {
    tmpdf <- data.frame(UniqueID=as.character(1:n),
                        Name=LETTERS[1:n],
                        Entity_Type=factor(c('this', 'that')),
                        Data1=rnorm(n),
                        Data2=rnorm(n),
                        Data3=rnorm(n),
                        Data4=rnorm(n),
                        Year=yr)
    assign(sprintf('datafrom%s', yr), tmpdf)
}
rm(yr, tmpdf)

ls()
## [1] "datafrom2006" "datafrom2008" "datafrom2010" "datafrom2012" "myfunc"      
## [6] "n"            "years"       

head(datafrom2006, n=2)
##   UniqueID Name Entity_Type      Data1      Data2      Data3      Data4 Year
## 1        1    A        this  1.3709584  1.3048697 -0.3066386  0.4554501 2006
## 2        2    B        that -0.5646982  2.2866454 -1.7813084  0.7048373 2006

In order to see the results for each data.frame, one would typically (though not always) do something like this:

myfunc(datafrom2006)
##      Data1      Data2      Data3      Data4 
##  0.5472968 -0.1634567 -0.1780795 -0.3639041 
myfunc(datafrom2008)
##       Data1       Data2       Data3       Data4 
## -0.02021535  0.01839391  0.53907680 -0.21787537 
myfunc(datafrom2010)
##       Data1       Data2       Data3       Data4 
##  0.25110630 -0.08719458  0.22924781 -0.19857243 
myfunc(datafrom2012)
##      Data1      Data2      Data3      Data4 
## -0.7949660  0.2102418 -0.2022066 -0.2458678 

List-Based

set.seed(42)
datafrom <- sapply(as.character(years), function(yr) {
                       data.frame(UniqueID=as.character(1:n),
                                  Name=LETTERS[1:n],
                                  Entity_Type=factor(c('this', 'that')),
                                  Data1=rnorm(n),
                                  Data2=rnorm(n),
                                  Data3=rnorm(n),
                                  Data4=rnorm(n),
                                  Year=yr)
                   }, simplify=FALSE)
str(datafrom)
## List of 4
##  $ 2006:'data.frame':    10 obs. of  8 variables:
##   ..$ UniqueID   : Factor w/ 10 levels "1","10","2","3",..: 1 3 4 5 6 7 8 9 10 2
##   ..$ Name       : Factor w/ 10 levels "A","B","C","D",..: 1 2 3 4 5 6 7 8 9 10
##   ..$ Entity_Type: Factor w/ 2 levels "that","this": 2 1 2 1 2 1 2 1 2 1
##   ..$ Data1      : num [1:10] 1.371 -0.565 0.363 0.633 0.404 ...
##   ..$ Data2      : num [1:10] 1.305 2.287 -1.389 -0.279 -0.133 ...
##   ..$ Data3      : num [1:10] -0.307 -1.781 -0.172 1.215 1.895 ...
##   ..$ Data4      : num [1:10] 0.455 0.705 1.035 -0.609 0.505 ...
##   ..$ Year       : Factor w/ 1 level "2006": 1 1 1 1 1 1 1 1 1 1
##  $ 2008:'data.frame':    10 obs. of  8 variables:
##   ..$ UniqueID   : Factor w/ 10 levels "1","10","2","3",..: 1 3 4 5 6 7 8 9 10 2
#### ...snip...

head(datafrom[[1]], n=2)
##   UniqueID Name Entity_Type      Data1      Data2      Data3      Data4 Year
## 1        1    A        this  1.3709584  1.3048697 -0.3066386  0.4554501 2006
## 2        2    B        that -0.5646982  2.2866454 -1.7813084  0.7048373 2006

head(datafrom[['2008']], n=2)
##   UniqueID Name Entity_Type      Data1       Data2      Data3       Data4 Year
## 1        1    A        this  0.2059986  0.32192527 -0.3672346 -1.04311894 2008
## 2        2    B        that -0.3610573 -0.78383894  0.1852306 -0.09018639 2008

However, with this you can test your function performance with just one:

myfunc(datafrom[[1]])
myfunc(datafrom[['2010']])

and then run the function on all of them very simply:

lapply(datafrom, myfunc)
## $`2006`
##      Data1      Data2      Data3      Data4 
##  0.5472968 -0.1634567 -0.1780795 -0.3639041 
## $`2008`
##       Data1       Data2       Data3       Data4 
## -0.02021535  0.01839391  0.53907680 -0.21787537 
## $`2010`
##       Data1       Data2       Data3       Data4 
##  0.25110630 -0.08719458  0.22924781 -0.19857243 
## $`2012`
##      Data1      Data2      Data3      Data4 
## -0.7949660  0.2102418 -0.2022066 -0.2458678 

Long-form Data

If instead you keep all of the data in the same data.frame, using your already-defined column of Year, you can still segment it for exploring individual years:

longdf <- do.call('rbind.data.frame', datafrom)
rownames(longdf) <- NULL
longdf[c(1,11,21,31),]
##    UniqueID Name Entity_Type      Data1     Data2      Data3       Data4 Year
## 1         1    A        this  1.3709584 1.3048697 -0.3066386  0.45545012 2006
## 11        1    A        this  0.2059986 0.3219253 -0.3672346 -1.04311894 2008
## 21        1    A        this  1.5127070 1.3921164  1.2009654 -0.02509255 2010
## 31        1    A        this -1.4936251 0.5676206 -0.0861073 -0.04069848 2012

Simple subsets:

  • subset(longdf, Year == 2006), though subset has its goods and others.
  • by(longdf, longdf$Year, myfunc)
  • If using library(dplyr), try longdf %>% filter(Year == 2010) %>% myfunc()

(Side note: when trying to plot aggregate data, it's often easier when the data is in this form, especially when using ggplot2-like layering and aesthetics.)

Rationale Against "Brute Force"

In answer to your comment question, when making different variables with the same structure, it is easy to deduce that you will be doing the same thing to each of them, in turn or immediately-consecutively. In general programming principle, many try to generalize what they do so that it if it can be done once, it can be done an arbitrary number of times without (heavily) adjusting the code. For instance, compare what was necessary in applying myfunc in the two examples above.

Further, if you later want to aggregate the results from your calls to myfunc, it is more laborious in the "brute force" example (as you must capture each return and combine manually), whereas the other two techniques can use simpler summarizing functions (e.g., another lapply, or perhaps Reduce or Filter).

Upvotes: 1

Related Questions