user3723816
user3723816

Reputation: 23

unexpected ddply() output. Not grouping

When I calculate the mean of a numeric column using ddply the output is not what I expect:

ddply(df, .(df[,1]) summarize, Sales = mean(df[,5]))

The output is:

df1[, 4]    Sales
1 X01.01.2012 49761.36
2 X01.02.2012 49761.36
3 X01.03.2012 49761.36
4 X01.04.2012 49761.36
5 X01.05.2012 49761.36
6 X01.06.2012 49761.36

I do not understand why the mean is the same, even though it is sorted by date. Is not the expected output given that each date the sales were different. It calculates the mean of the whole column.

Upvotes: 0

Views: 417

Answers (2)

Rich Scriven
Rich Scriven

Reputation: 99331

The second argument should be .(variable name). df[,1] refers to the values in the column, not the name of the variable. Same thing when you use mean()

Here's a short example with fake data, since you did not supply any.

> df <- data.frame(val1 = 1:5, val2 = 6:10)
> library(plyr)
## correct mean
> ddply(df, .(val1, val2), summarize, mean = mean(c(val1, val2)))
  val1 val2 mean
1    1    6  3.5
2    2    7  4.5
3    3    8  5.5
4    4    9  6.5
5    5   10  7.5
## incorrect mean
> ddply(df, .(df[,1], df[,2]), summarize, mean = mean(c(df[,1], df[,2])))
  df[, 1] df[, 2] mean
1       1       6  5.5
2       2       7  5.5
3       3       8  5.5
4       4       9  5.5
5       5      10  5.5

If this doesn't resolve your issue, please provide a sample of your data so that we can reproduce your problem.

Upvotes: 2

crowding
crowding

Reputation: 1508

df is the name of your entire data frame; ddply and summarize don't change the meaning of df. summarize is designed to work with named columns, Do your columns have names? If so use those, which would look something like

ddply(df, .(date), summarize, Sales=mean(sales))

One way to handle columns by position is to in place of summarize specify a function that operates on the chunk:

ddply(df, .(df[,1]), function(chunk) data.frame(Sales=mean(chunk[,5])))

but I would rather recommend giving your data column names instead:

colnames(df)[c(1,5)] <- c("date", "sales")
ddply(df, .(date), summarize, Sales=mean(sales))

Upvotes: 1

Related Questions