Jared C
Jared C

Reputation: 372

ddplyr summarize by year, include count of year

This should be super simple, but I can't seem to figure it out.

I am using the ggplot2movies library to get the dataframe movies and I am trying to summarize the data into a dataframe that is easier to graph with. In case you don't want to load the ggplot2movies library, a sample of the relevant data is:

# A tibble: 6 x 2
   year rating
  <int>  <dbl>
1  1971    6.4
2  1939    6  
3  1941    8.2
4  1996    8.2
5  1975    3.4
6  2000    4.3

I have the following successful code, based on the plyr library:

years <- ddply(movies,"year",summarize,rating=mean(rating))

Which gives such a result, perfect for a plot or line chart:

> head(years)
  year   rating
1 1893 7.000000
2 1894 4.888889
3 1895 5.500000
4 1896 5.269231
5 1897 4.677778
6 1898 5.040000

However, I can't sort out a way to add a count column, in order to have a third variable, such as size, which can visualize the volume of movies produced each year on the plot chart. It should be something simple like:

years <- ddply(movies,"year",summarize,rating=mean(rating),count=count(years))

However, this gives an error:

Error in summarise_impl(.data, dots) : Evaluation error: no applicable method for 'groups' applied to an object of class "character".

I could add a column to the original dataframe that is just a repeating value of 1, and then sum that column. But with how versatile and useful R is, I think there is some much more simple and appropriate way to do it within the ddplyr function.

Upvotes: 3

Views: 288

Answers (1)

Rui Barradas
Rui Barradas

Reputation: 76653

You can use n() to give the count.

library(ggplot2movies)
library(dplyr)

data("movies")

movies %>%
  group_by(year) %>%
  summarise(rating = mean(rating),
            years = n()) -> mvs

head(mvs, 10)
## A tibble: 10 x 3
#    year rating years
#   <int>  <dbl> <int>
# 1  1893   7        1
# 2  1894   4.89     9
# 3  1895   5.5      3
# 4  1896   5.27    13
# 5  1897   4.68     9
# 6  1898   5.04     5
# 7  1899   4.28     9
# 8  1900   4.73    16
# 9  1901   4.68    28
#10  1902   4.9      9

Another solution is with package plyr, as suggested by the OP.

library(plyr)

mvs2 <- ddply(movies, "year", summarize, 
              rating = mean(rating), years = length(year))
all.equal(mvs, mvs2)
#[1] TRUE

Upvotes: 2

Related Questions