name_masked
name_masked

Reputation: 9803

Issue with result of ddply on the data frame - R

So I have a data frame, say with following data:

    Count    Amount    Org         Bank
------------------------------------------
      1        100      ABC       Chase
      15        76       DEF    American Express
    ...
    ...

When I run the ddply using:

result1 <- ddply(df, 4, count = sum(as.numeric(df[[1]])), amt = sum(as.numeric(df[[2]])))

I get the result with result1 having the same value (i.e. count and amt) for all rows i.e.

 description      count        amt
  Chase             900        432087
  American Express  900        432087
.....

which is definitely not the case. Somehow, it seems like the last sum() value being calculated is applied to all the rows. Am I missing something here?

Upvotes: 0

Views: 1298

Answers (1)

Alex Brown
Alex Brown

Reputation: 42942

There are a few problems here:

  1. You are gettting the same/wrong result because you are referring back to the original dataframe df in the arguments to ddply - e.g. df[[1]].
    Ddply doesn't work like that - use column names directly, e.g. Amount and Count.

  2. You are missing the .fun function argument to ddply - in this case summarize is appropriate.
    (I honestly don't know how your code worked at all without this.)

  3. You are using an undocumented way (4) to select group columns in the .variable argument. Try .(Bank) or c("Bank") instead.

This should work:

ddply(df, .(Bank), summarize, count = sum(as.numeric(Count)),
                              amt = sum(as.numeric(Amount)))

Upvotes: 7

Related Questions