rschwieb
rschwieb

Reputation: 786

Doing counts for a column of a dataframe in R

I have a dataframe "samp" with a column (let's call it "rating") which takes on several values (let's say one of the following: "good", "medium", "bad".)

I would like to group-by on several other columns and count the frequency of "good", "medium" and "bad" and report those frequencies in new columns. (So maybe col1 is movie year, col2 is genre, and then there should be three more columns telling you how many of each type of rating there were for each year and genre.)

 ddply(samp,c("col1","col2"), summarize, 
       good=table(samp$rating)["good"],
       medium=table(samp$rating)["medium"],
       bad=table(samp$rating)["bad"])

The problem is (I think) that the functions I'm defining are not in terms of the groups ddply is outputting, they are just constant functions of samp. How can I define the functions here so that they're functions of the groups?

I tried using an anonymous function:

 ddply(samp,c("col1","col2"), summarize, 
       good=function(df)table(df$rating)["good"],
       medium=function(df)table(df$rating)["medium"],
       bad=function(df)table(df$rating)["bad"])

I can never get it working though. I think the error I've gotten the most from this is

 Error in output[[var]][rng] <- df[[var]] : 
 incompatible types (from closure to logical) in subassignment type fix

So lay it on me. What's the ridiculously simple solution that did not turn up while I blundered around trying 948506 combinations of ddply and table? Thank you.

Upvotes: 0

Views: 393

Answers (2)

redmode
redmode

Reputation: 4941

Generic data:

samp <- data.frame(rating=c("bad","medium","good","bad","medium","good"),
                   col1=c(2007,2010,2007,2009,2010,2010),
                   col2=c("fiction","fiction","fiction","drama","drama","drama"))

Code (you shouldn't use samp$ before columns' names):

ddply(samp,c("col1","col2"), summarize, 
      good=sum(rating == "good"),
      medium=sum(rating == "medium"),
      bad=sum(rating == "bad"))

Output:

  col1    col2 good medium bad
1 2007 fiction    1      0   1
2 2009   drama    0      0   1
3 2010   drama    1      1   0
4 2010 fiction    0      1   0

Upvotes: 1

Sven Hohenstein
Sven Hohenstein

Reputation: 81693

Just remove all instances of samp$ inside ddply and it will work:

ddply(samp,c("col1","col2"), summarize, 
  good=table(rating)["good"],
  medium=table(rating)["medium"],
  bad=table(rating)["bad"])

Upvotes: 2

Related Questions