Anthony W
Anthony W

Reputation: 1327

Using the r dplyr library to generate aggregate numbers in a new column

I am trying to use dplyr to generate a new column in a data frame, based on the aggregation of values in existing columns. Given my dataframe:

group1 <- c("2019","2019","2019","2018","2018","2017","2017","2017")
group2 <- c("2019-01-01", "2019-01-01","2019-01-01","2018-05-01","2018-06-01","2017-01-01","2017-01-01","2017-02-01")
group3 <- c("A","A","B","A","A","C","C","B")
df <- data.frame("Year" = group1,"Date" = group2,"Sample" = group3)

Gives:

  Year      Date  Sample
1 2019 2019-01-01   A
2 2019 2019-01-01   A
3 2019 2019-01-01   B
4 2018 2018-05-01   A
5 2018 2018-06-01   A
6 2017 2017-01-01   C
7 2017 2017-01-01   C
8 2017 2017-02-01   B

So I'd like to generate new column "Count", that for each row gives the total number of unique dates per sample. So for the above data, I would expect the results to be:

  Year       Date Sample Count
1 2019 2019-01-01   A     1
2 2019 2019-01-01   A     1
3 2019 2019-02-01   B     1
4 2018 2018-05-01   A     2
5 2018 2018-06-01   C     2
6 2017 2017-01-01   C     1
7 2017 2017-01-01   C     1
8 2017 2017-02-01   B     1

I've tried using the following code in r:

df %>%
  group_by(Year) %>%
    group_by(Sample) %>%
      group_by(Date) %>%
        mutate(Count = n_distinct(Date))

But I'm not getting the correct answer!

Upvotes: 1

Views: 22

Answers (1)

arg0naut91
arg0naut91

Reputation: 14764

You could try:

library(dplyr)

df %>% 
  group_by(Year, Sample) %>% 
  mutate(Count = n_distinct(Date))

If you want to pass several variables to group_by, you need to put them together - what you were doing is cancelling out the previous groupings by each new statement.

Moreover, if you'd like to count unique dates, you shouldn't group by them.

The above code would give:

# A tibble: 8 x 4
# Groups:   Year, Sample [6]
  Year  Date       Sample Count
  <fct> <fct>      <fct>  <int>
1 2019  2019-01-01 A          1
2 2019  2019-01-01 A          1
3 2019  2019-01-01 B          1
4 2018  2018-05-01 A          2
5 2018  2018-06-01 A          2
6 2018  2017-01-01 C          1
7 2017  2017-01-01 C          1
8 2017  2017-02-01 B          1

Note that there is a mismatch between your generated data frame and the one you show us. The data frame generated by your code is:

  Year       Date Sample
1 2019 2019-01-01      A
2 2019 2019-01-01      A
3 2019 2019-01-01      B
4 2018 2018-05-01      A
5 2018 2018-06-01      A
6 2018 2017-01-01      C
7 2017 2017-01-01      C
8 2017 2017-02-01      B

Where indeed the only Sample with 2 distinct Dates in a given Year is A (in 2018).

Upvotes: 2

Related Questions