Reputation: 23
When I calculate the mean of a numeric column using ddply the output is not what I expect:
ddply(df, .(df[,1]) summarize, Sales = mean(df[,5]))
The output is:
df1[, 4] Sales 1 X01.01.2012 49761.36 2 X01.02.2012 49761.36 3 X01.03.2012 49761.36 4 X01.04.2012 49761.36 5 X01.05.2012 49761.36 6 X01.06.2012 49761.36
I do not understand why the mean is the same, even though it is sorted by date. Is not the expected output given that each date the sales were different. It calculates the mean of the whole column.
Upvotes: 0
Views: 417
Reputation: 99331
The second argument should be .(variable name)
. df[,1]
refers to the values in the column, not the name of the variable. Same thing when you use mean()
Here's a short example with fake data, since you did not supply any.
> df <- data.frame(val1 = 1:5, val2 = 6:10)
> library(plyr)
## correct mean
> ddply(df, .(val1, val2), summarize, mean = mean(c(val1, val2)))
val1 val2 mean
1 1 6 3.5
2 2 7 4.5
3 3 8 5.5
4 4 9 6.5
5 5 10 7.5
## incorrect mean
> ddply(df, .(df[,1], df[,2]), summarize, mean = mean(c(df[,1], df[,2])))
df[, 1] df[, 2] mean
1 1 6 5.5
2 2 7 5.5
3 3 8 5.5
4 4 9 5.5
5 5 10 5.5
If this doesn't resolve your issue, please provide a sample of your data so that we can reproduce your problem.
Upvotes: 2
Reputation: 1508
df
is the name of your entire data frame; ddply and summarize don't change the meaning of df
. summarize
is designed to work with named columns, Do your columns have names? If so use those, which would look something like
ddply(df, .(date), summarize, Sales=mean(sales))
One way to handle columns by position is to in place of summarize
specify a function that operates on the chunk:
ddply(df, .(df[,1]), function(chunk) data.frame(Sales=mean(chunk[,5])))
but I would rather recommend giving your data column names instead:
colnames(df)[c(1,5)] <- c("date", "sales")
ddply(df, .(date), summarize, Sales=mean(sales))
Upvotes: 1