Reputation: 23
In the process of learning. Didn't ask my first question well, so I'm trying again and doing my best to be more clear.
I'm trying to create a series of data frames for a reproducible question for my larger issue. I would like to make 4 data frames, each named differently by the year. Eventually I will merge these four data frames to explain where I am encountering my issue.
Here is the most recent solution. This runs, but instead creates a list of four data frames without any frames in the global directory.
datafrom <- list()
years <- c(2006,2008,2010,2012)
for (i in 1:length(years)) {
UniqueID <- 1:10 # <- Not all numeric - Kept as character vector
Name <- LETTERS[seq( from = 1, to = 10 )]
Entity_Type <- factor("This","That")
Data1 <- rnorm(10)
Data2 <- rnorm(10)
Data3 <- rnorm(10)
Data4 <- rnorm(10)
Year <- years[i]
datafrom[[i]] <- data.frame(UniqueID, Name, Entity_Type, Data1, Data2, Data3, Data4, Year)
}
I would like 4 separate data frames, each named datafrom2006, datafrom2008, etc.
Many thanks in advance for your patience with my learning.
Upvotes: 0
Views: 389
Reputation: 160447
I'll demonstrate a few (of many) techniques here, and I'll call them (1) brute force, (2) list-based, and (3) single long-form data.frame.
I'll add to the example the use of a function that you want to apply to each data.frame. Though contrived, it helps makes the point:
## some constants used throughout
years <- c(2006, 2008, 2010, 2012)
n <- 10
myfunc <- function(x) {
interestingPart <- x[ , grepl('^Data', colnames(x)) ]
sapply(interestingPart, mean)
}
Yes, you can create multiple like-named and same-structure data.frames from a loop, though it is typically frowned upon by many experienced (R?) programmers:
set.seed(42)
for (yr in years) {
tmpdf <- data.frame(UniqueID=as.character(1:n),
Name=LETTERS[1:n],
Entity_Type=factor(c('this', 'that')),
Data1=rnorm(n),
Data2=rnorm(n),
Data3=rnorm(n),
Data4=rnorm(n),
Year=yr)
assign(sprintf('datafrom%s', yr), tmpdf)
}
rm(yr, tmpdf)
ls()
## [1] "datafrom2006" "datafrom2008" "datafrom2010" "datafrom2012" "myfunc"
## [6] "n" "years"
head(datafrom2006, n=2)
## UniqueID Name Entity_Type Data1 Data2 Data3 Data4 Year
## 1 1 A this 1.3709584 1.3048697 -0.3066386 0.4554501 2006
## 2 2 B that -0.5646982 2.2866454 -1.7813084 0.7048373 2006
In order to see the results for each data.frame, one would typically (though not always) do something like this:
myfunc(datafrom2006)
## Data1 Data2 Data3 Data4
## 0.5472968 -0.1634567 -0.1780795 -0.3639041
myfunc(datafrom2008)
## Data1 Data2 Data3 Data4
## -0.02021535 0.01839391 0.53907680 -0.21787537
myfunc(datafrom2010)
## Data1 Data2 Data3 Data4
## 0.25110630 -0.08719458 0.22924781 -0.19857243
myfunc(datafrom2012)
## Data1 Data2 Data3 Data4
## -0.7949660 0.2102418 -0.2022066 -0.2458678
set.seed(42)
datafrom <- sapply(as.character(years), function(yr) {
data.frame(UniqueID=as.character(1:n),
Name=LETTERS[1:n],
Entity_Type=factor(c('this', 'that')),
Data1=rnorm(n),
Data2=rnorm(n),
Data3=rnorm(n),
Data4=rnorm(n),
Year=yr)
}, simplify=FALSE)
str(datafrom)
## List of 4
## $ 2006:'data.frame': 10 obs. of 8 variables:
## ..$ UniqueID : Factor w/ 10 levels "1","10","2","3",..: 1 3 4 5 6 7 8 9 10 2
## ..$ Name : Factor w/ 10 levels "A","B","C","D",..: 1 2 3 4 5 6 7 8 9 10
## ..$ Entity_Type: Factor w/ 2 levels "that","this": 2 1 2 1 2 1 2 1 2 1
## ..$ Data1 : num [1:10] 1.371 -0.565 0.363 0.633 0.404 ...
## ..$ Data2 : num [1:10] 1.305 2.287 -1.389 -0.279 -0.133 ...
## ..$ Data3 : num [1:10] -0.307 -1.781 -0.172 1.215 1.895 ...
## ..$ Data4 : num [1:10] 0.455 0.705 1.035 -0.609 0.505 ...
## ..$ Year : Factor w/ 1 level "2006": 1 1 1 1 1 1 1 1 1 1
## $ 2008:'data.frame': 10 obs. of 8 variables:
## ..$ UniqueID : Factor w/ 10 levels "1","10","2","3",..: 1 3 4 5 6 7 8 9 10 2
#### ...snip...
head(datafrom[[1]], n=2)
## UniqueID Name Entity_Type Data1 Data2 Data3 Data4 Year
## 1 1 A this 1.3709584 1.3048697 -0.3066386 0.4554501 2006
## 2 2 B that -0.5646982 2.2866454 -1.7813084 0.7048373 2006
head(datafrom[['2008']], n=2)
## UniqueID Name Entity_Type Data1 Data2 Data3 Data4 Year
## 1 1 A this 0.2059986 0.32192527 -0.3672346 -1.04311894 2008
## 2 2 B that -0.3610573 -0.78383894 0.1852306 -0.09018639 2008
However, with this you can test your function performance with just one:
myfunc(datafrom[[1]])
myfunc(datafrom[['2010']])
and then run the function on all of them very simply:
lapply(datafrom, myfunc)
## $`2006`
## Data1 Data2 Data3 Data4
## 0.5472968 -0.1634567 -0.1780795 -0.3639041
## $`2008`
## Data1 Data2 Data3 Data4
## -0.02021535 0.01839391 0.53907680 -0.21787537
## $`2010`
## Data1 Data2 Data3 Data4
## 0.25110630 -0.08719458 0.22924781 -0.19857243
## $`2012`
## Data1 Data2 Data3 Data4
## -0.7949660 0.2102418 -0.2022066 -0.2458678
If instead you keep all of the data in the same data.frame, using your already-defined column of Year
, you can still segment it for exploring individual years:
longdf <- do.call('rbind.data.frame', datafrom)
rownames(longdf) <- NULL
longdf[c(1,11,21,31),]
## UniqueID Name Entity_Type Data1 Data2 Data3 Data4 Year
## 1 1 A this 1.3709584 1.3048697 -0.3066386 0.45545012 2006
## 11 1 A this 0.2059986 0.3219253 -0.3672346 -1.04311894 2008
## 21 1 A this 1.5127070 1.3921164 1.2009654 -0.02509255 2010
## 31 1 A this -1.4936251 0.5676206 -0.0861073 -0.04069848 2012
Simple subsets:
subset(longdf, Year == 2006)
, though subset has its goods and others.by(longdf, longdf$Year, myfunc)
library(dplyr)
, try longdf %>% filter(Year == 2010) %>% myfunc()
(Side note: when trying to plot aggregate data, it's often easier when the data is in this form, especially when using ggplot2
-like layering and aesthetics.)
In answer to your comment question, when making different variables with the same structure, it is easy to deduce that you will be doing the same thing to each of them, in turn or immediately-consecutively. In general programming principle, many try to generalize what they do so that it if it can be done once, it can be done an arbitrary number of times without (heavily) adjusting the code. For instance, compare what was necessary in applying myfunc
in the two examples above.
Further, if you later want to aggregate the results from your calls to myfunc
, it is more laborious in the "brute force" example (as you must capture each return and combine manually), whereas the other two techniques can use simpler summarizing functions (e.g., another lapply
, or perhaps Reduce
or Filter
).
Upvotes: 1