ben_says
ben_says

Reputation: 2513

R new column summarizing count of groups of columns

library(data.table)
df <- structure(
  list(
    type = c("AAA", "AAA", "AAA", "BCD", "BCD", "BCD", "EEE", "EEE", "EEE", "EEE"), 
    date = c("2015-01-01", "2015-01-01", "2015-01-01", "2015-01-02", "2015-01-05", "2015-01-05", "2015-01-04", "2015-01-04", "2015-01-04", "2015-01-04")
    ), 
  .Names = c("type", "date"), 
  class = "data.frame", 
  row.names = c(0L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L))
df$date <- as.Date(df$date)
df

sets up the following example data frame uniquely named 'df'

  type       date
0  AAA 2015-01-01
1  AAA 2015-01-01
2  AAA 2015-01-01
3  BCD 2015-01-02
4  BCD 2015-01-05
5  BCD 2015-01-05
6  EEE 2015-01-04
7  EEE 2015-01-04
8  EEE 2015-01-04
9  EEE 2015-01-04

I am asking for help on how base R, data.table, or even dplyr lovers create a new column which lists the number of times a 'type' is recorded for a given 'date'.

  type       date typeDateGroup
0  AAA 2015-01-01             3 
1  AAA 2015-01-01             3
2  AAA 2015-01-01             3
3  BCD 2015-01-02             1
4  BCD 2015-01-05             2
5  BCD 2015-01-05             2
6  EEE 2015-01-04             4
7  EEE 2015-01-04             4
8  EEE 2015-01-04             4
9  EEE 2015-01-04             4

If it helps knowing, in contrast to this example, usually my data includes 3-5mm rows.

don't run this, it was my attempt, and it fails...

library(data.table)
df <- as.data.table(df)
df<-df[order(type, date), `:=`(typeDateGroup = .N), by=type, date]

Thank you for looking at this and dominating with your skills.

Upvotes: 1

Views: 2705

Answers (2)

David Arenburg
David Arenburg

Reputation: 92300

For future knowledge, in your data.table version, if you want to override df just do assigment by reference, i.e., setDT(df) instead of df <- as.data.table(df).

Also, when using assignment by reference (:=) within the data.table object, there is no need in df<-.

Moreover, you can also sort your data.table using data.tables setorder function (though don't have to, not in this specific case, neither in general).

Lastly, when passing two variables into the by argument, you should use either list(type, date) or .(type, date) or c("type", "date") or "type,date"

So for completeness, here's the dplyr version

library(dplyr)
df %>% 
  group_by(type, date) %>% 
  mutate(typeDateGroup = n())

Upvotes: 5

Josh O&#39;Brien
Josh O&#39;Brien

Reputation: 162451

A couple of options:

## Using base R only:
df <- transform(df, typeDateGroup=ave(as.numeric(date), type, date, FUN=length))

## With data.table:
library(data.table)
dt <- data.table(df)
dt[, typeDateGroup:=.N, by=c("type","date")]

Upvotes: 4

Related Questions