Reputation: 53
I want to loop a function over dataframes and add information from the function to the data frame. I have read the many relevant posts on looping and applying functions to dataframes that have got me close to what I need, but not quite there, so Im hoping people can help me.
I have a number of dataframes that look like this
dat1=as.data.frame(matrix(rnorm(25),ncol=5))
dat2=as.data.frame(matrix(rnorm(25),ncol=5))
dat3=as.data.frame(matrix(rnorm(25),ncol=5))
I want to calculate the mean of some of these columns and add on to the end of the dataframe. So, I wrote a function to calculate means from a dataframe:
my_fun <- function(dataframe){
rowMeans( dataframe[ , c("V1","V2")],na.rm=TRUE)
}
To apply this function to one dataframe, I do this:
dat1$V6<- my_fun(dat1)
But I want to be able to loop through all dataframes and add this mean column on to the end of each df.
After reading some helpful posts, I created a list and used sapply:
dfList<-list(dat1,dat2,dat3)#create list
sapply(dfList, my_fun) #apply function to list
Which gives me the values that I want, but I dont want them in a separate output- I'd like them to simply add a column on to the original dataframe- as happens when I apply the function to the individual dataframes. Can anyone tell me how to do this? Or point me to a post that describes how to do this (I have searched high and low, but maybe typing the wrong keywords) Im sure its very straightforward if you know how!
Upvotes: 5
Views: 24058
Reputation: 92282
Could use Map
between dfList
and your lapply
function (instead of sapply
)
temp <- Map(cbind, dfList, V6 = lapply(dfList, my_fun)) # Add V6 column to `dfList`
datasnames <- c("dat1","dat2","dat3") # write here the names of your data sets (in the same order they were specified in `dfList`)
names(temp) <- datasnames
list2env(temp, envir = .GlobalEnv)
Edit (Per OPs additional request in comments)
So there could be many ways to do this, one is to modify your function
my_fun <- function(dataframe){
V6 <- rowMeans( dataframe[ , c("V1","V2")],na.rm=TRUE);
V7 <- sd(dataframe[, "V1"],na.rm=TRUE)
return(cbind(V6, V7))
}
temp <- Map(cbind, dfList, lapply(dfList, my_fun)) # Add V6 and V7 columns to `dfList`
names(temp) <- datasnames
list2env(temp, envir = .GlobalEnv)
Upvotes: 1
Reputation: 887048
Try this:
Just to clear the confusion.
dat1=as.data.frame(matrix(rnorm(25),ncol=5))
dat5=as.data.frame(matrix(rnorm(25),ncol=5))
dat7=as.data.frame(matrix(rnorm(25),ncol=5))
my_fun <- function(dataframe){
rowMeans( dataframe[ , c("V1","V2")],na.rm=TRUE)
}
dfList<-list(dat1,dat5,dat7)
Vars <- grep("dat", ls(), value=TRUE)
Vars
#[1] "dat1" "dat5" "dat7"
res <- lapply(dfList, function(x) transform(x,V6=my_fun(x)))
for(i in 1:length(Vars)){
assign(Vars[i], res[[i]],envir=.GlobalEnv)
}
A Second function:
my_funSD <- function(dataframe){
apply( dataframe[ , c("V1","V2")],1,sd)
}
dfList<-list(dat1,dat5,dat7)
res1 <- lapply(dfList, function(x) transform(x,V7=my_funSD(x)))
ncol(dat1)
#[1] 6
for(i in 1:length(Vars)){
assign(Vars[i], res1[[i]],envir=.GlobalEnv)
}
ncol(dat1)
#[1] 7
colnames(dat1)
[1] "V1" "V2" "V3" "V4" "V5" "V6" "V7"
Upvotes: 1
Reputation: 1272
It's soo much easier and at least 100x faster if you use data.table
require(data.table)
set.seed(612)
dat1=as.data.table(matrix(rnorm(25),ncol=5))
dat2=as.data.table(matrix(rnorm(25),ncol=5))
dat3=as.data.table(matrix(rnorm(25),ncol=5))
dtList <- list(dat1, dat2, dat3)
for (dat in dtList) {
dat[,V6:=(V1 + V2)/2]
}
This gives the following output:
> dtList
[[1]]
V1 V2 V3 V4 V5 V6
1: 0.3903228 -1.1581608 1.0171311 0.3866628 0.02756137 -0.38391897
2: -0.6030124 0.4713771 -2.4204376 -0.2843527 0.53463600 -0.06581764
3: -0.9850333 0.3343518 -1.2329712 -1.1767533 0.56714483 -0.32534080
4: -0.1591335 -0.6729444 0.5062648 -0.3001857 -0.84896068 -0.41603897
5: 1.7127203 0.3149884 1.7633945 1.7824786 -0.90316850 1.01385434
[[2]]
V1 V2 V3 V4 V5 V6
1: -1.22790810 0.8429506 0.4921844 -0.29686607 -0.9501956 -0.1924788
2: 0.09405923 -1.6970403 0.1280003 1.22284944 0.8667643 -0.8014905
3: 0.55298783 -0.1081849 0.4120268 -0.56411756 1.9135802 0.2224015
4: -0.82621808 0.4753731 0.4755664 -0.05885804 0.9658787 -0.1754225
5: 0.44262554 0.3036363 -1.7404580 0.88870595 1.4826431 0.3731309
[[3]]
V1 V2 V3 V4 V5 V6
1: 0.82085834 0.07221027 1.8835042 0.2563714 0.27891033 0.44653430
2: 0.00445113 1.89450534 0.3878858 1.8385587 -1.86381524 0.94947824
3: 0.66458950 -1.31023362 -0.9403257 1.2128128 0.74922668 -0.32282206
4: -1.40169143 -1.52925147 0.8232823 0.3391147 0.33463875 -1.46547145
5: 1.10566340 -1.16512217 0.3859652 0.8123110 0.04712086 -0.02972939
Upvotes: 3
Reputation: 99331
You could adjust your function so that it assembles the data frame before returning the result. Then you can use lapply
with the other data frames.
> add.column <- function(dataframe){
dataframe$Mean <- rowMeans(dataframe[, c("V1","V2")], na.rm = TRUE)
dataframe
}
> lapply(list(dat1, dat2, dat3), add.column)
[[1]]
V1 V2 V3 V4 V5 Mean
1 0.3353508 0.01120530 1.24966428 -0.11351381 -0.1004569 0.1732780
2 -0.4815519 -0.06247284 -0.02295877 0.87322960 -0.5307487 -0.2720124
3 0.7600038 1.85289546 0.36312147 -0.02831645 -1.6644752 1.3064496
4 0.1679056 1.27509579 0.80371659 -0.85136078 -0.1014557 0.7215007
5 -1.7393888 0.72481368 0.06583188 -0.61422707 1.7497541 -0.5072876
[[2]]
V1 V2 V3 V4 V5 Mean
1 -1.5753298 -0.6997616 -0.4343483 0.04134617 -2.3060171 -1.13754567
2 0.1871898 -0.2132065 -0.1839468 -0.29277828 1.6798379 -0.01300837
3 -0.1530399 -1.0674358 -0.2897129 -0.49723256 -1.0571164 -0.61023786
4 -1.5099567 0.4665531 -0.7433692 1.71985447 -0.3236015 -0.52170179
5 -0.6545995 -0.5638415 -0.5925143 -0.10060696 0.1332452 -0.60922051
[[3]]
V1 V2 V3 V4 V5 Mean
1 0.1716287 -0.09367833 0.12334527 -0.0061906 1.4477591 0.03897519
2 -1.1550710 0.26045021 1.80286581 0.4236090 -0.0848587 -0.44731037
3 0.3735215 -0.70953979 -0.08194104 0.7470116 0.6457781 -0.16800913
4 2.0178667 -0.90537252 1.51190214 -1.0803367 0.7308533 0.55624709
5 0.4143222 0.89980628 -1.87445683 -1.8482057 0.1797795 0.65706422
Also, you might want to add a subset
argument in case you want to take the mean of other subsets later. Something lie this:
> add.column <- function(data, subset)
{
data$Mean <- rowMeans(data[, subset], na.rm = TRUE)
data
}
Upvotes: 1