Reputation: 237
Suppose you have the following table:
Student<-c("Bob", "Joe", "Sam", "John")
ClassDate<-as.Date(c("2020-01-01", "2020-01-01", "2020-01-02", "2020-01-05"), "%Y-%m-%d")
df<-data.frame(Student=Student, ClassDate=ClassDate)
df
Student ClassDate
1 Bob 2020-01-01
2 Joe 2020-01-01
3 Sam 2020-01-02
4 John 2020-01-05
When you make a cumulative frequency table for ClassDate, you get the following:
data.frame(cumsum(table(df$ClassDate)))
cumsum.table.df.ClassDate..
2020-01-01 2
2020-01-02 3
2020-01-05 4
However, what I'm looking for is the following which still includes the missing dates
cumsum.table.df.ClassDate..
2020-01-01 2
2020-01-02 3
2020-01-03 3
2020-01-04 3
2020-01-05 4
Upvotes: 5
Views: 89
Reputation: 16099
This is a good example where using Rcpp is not only (much) faster but also clearer.
Rcpp::sourceCpp(
code = '
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export(rng = false)]]
IntegerVector tabulate_complete(IntegerVector x, IntegerVector tbl) {
R_xlen_t N = x.length();
R_xlen_t TN = tbl.length();
IntegerVector out(TN);
int t0 = tbl[0];
for (R_xlen_t i = 0; i < N; ++i) {
int xi = x[i];
out[xi - t0] += 1;
}
return out;
}
')
Tabulate_Dates <- function(.dates, all_dates = NULL) {
if (is.null(all_dates)) {
all_dates <- seq(as.Date("1900-01-01"), Sys.Date(), by = 1)
}
out <- tabulate_complete(as.integer(.dates), as.integer(all_dates))
data.table::data.table(all_dates, out) # purely for ease of printing
}
example_dates <- dqrng::dqsample(seq(as.Date("1900-01-01"), Sys.Date(), by = 1),
size = 36500, # not all dates
replace = TRUE)
example_many_dates <- rep_len(example_dates, 1e7)
bench::system_time(Tabulate_Dates(example_many_dates))
# > 0.050 s
# > 2.6 s for tibble method
# > 170 s for aggregate base R method
Upvotes: 2
Reputation: 19163
An efficient base R approach:
The motivation here is to get a match-list of existing dates against all dates simply by using colSum
of the resulting TRUE
/FALSE
matrix.
df <- structure(list(Student = c("Bob", "Joe", "Sam", "John"),
ClassDate = structure(c(18262, 18262, 18263, 18266),
class = "Date")), class = "data.frame", row.names = c(NA, -4L))
# Student ClassDate
#1 Bob 2020-01-01
#2 Joe 2020-01-01
#3 Sam 2020-01-02
#4 John 2020-01-05
dates <- seq( df$ClassDate[1], df$ClassDate[nrow(df)], by=1 )
data.frame( dates, cumsum=cumsum( colSums(sapply( dates, function(x) df$ClassDate == x )) ) )
dates cumsum
1 2020-01-01 2
2 2020-01-02 3
3 2020-01-03 3
4 2020-01-04 3
5 2020-01-05 4
Upvotes: 1
Reputation: 887571
An option is to create a column of 1s, expand the data with complete
by creating a seq
uence from min
imum to max
imum value of 'ClassDate' by
'day' while fill
ing the 'n' with 0, then do a group by sum
on the 'n' column, and do the cumsum
library(dplyr)
library(tidyr)
df %>%
mutate(n = 1) %>%
complete(ClassDate = seq(min(ClassDate), max(ClassDate),
by = '1 day'), fill = list(n = 0)) %>%
group_by(ClassDate) %>%
summarise(n = sum(n), .groups = 'drop') %>%
mutate(n = cumsum(n))
-output
# A tibble: 5 x 2
# ClassDate n
#* <date> <dbl>
#1 2020-01-01 2
#2 2020-01-02 3
#3 2020-01-03 3
#4 2020-01-04 3
#5 2020-01-05 4
In base R
, an option is also to specify the levels
while converting to factor
v1 <- with(df, factor(ClassDate, levels =
as.character(seq(min(ClassDate), max(ClassDate), by = '1 day'))))
data.frame(Cumsum = cumsum(table(v1)))
Upvotes: 2
Reputation: 102309
A base R option
aggregate(
cumfreq ~ ClassDate,
transform(
merge(
cbind(df, cumfreq = 1),
data.frame(
ClassDate = seq(min(df$ClassDate), max(df$ClassDate), by = "day")
),
all = TRUE
),
cumfreq = cumsum(replace(cumfreq, is.na(cumfreq), 0))
),
max
)
gives
ClassDate cumfreq
1 2020-01-01 2
2 2020-01-02 3
3 2020-01-03 3
4 2020-01-04 3
5 2020-01-05 4
Upvotes: 1