Reputation: 237

When making a table for dates in R, how can I fill in the missing dates to appear as having a frequency of 0?

Suppose you have the following table:

Student<-c("Bob", "Joe", "Sam", "John")
ClassDate<-as.Date(c("2020-01-01", "2020-01-01", "2020-01-02", "2020-01-05"), "%Y-%m-%d")
df<-data.frame(Student=Student, ClassDate=ClassDate)
df

  Student  ClassDate
1     Bob 2020-01-01
2     Joe 2020-01-01
3     Sam 2020-01-02
4    John 2020-01-05

When you make a cumulative frequency table for ClassDate, you get the following:

data.frame(cumsum(table(df$ClassDate)))

           cumsum.table.df.ClassDate..
2020-01-01                           2
2020-01-02                           3
2020-01-05                           4

However, what I'm looking for is the following which still includes the missing dates

           cumsum.table.df.ClassDate..
2020-01-01                           2
2020-01-02                           3
2020-01-03                           3
2020-01-04                           3
2020-01-05                           4

Upvotes: 5

Answers (4)

Hugh

Reputation: 16099

This is a good example where using Rcpp is not only (much) faster but also clearer.

Rcpp::sourceCpp(
  code = '
          #include <Rcpp.h>
          using namespace Rcpp;
          
          // [[Rcpp::export(rng = false)]]
          IntegerVector tabulate_complete(IntegerVector x, IntegerVector tbl) {
            R_xlen_t N = x.length();
            R_xlen_t TN = tbl.length();
            IntegerVector out(TN);
            int t0 = tbl[0];
            for (R_xlen_t i = 0; i < N; ++i) {
              int xi = x[i];
              out[xi - t0] += 1;
            }
            return out;
          }
          
          ')


Tabulate_Dates <- function(.dates, all_dates = NULL) {
  if (is.null(all_dates)) {
    all_dates <- seq(as.Date("1900-01-01"), Sys.Date(), by = 1)
  }
  out <- tabulate_complete(as.integer(.dates), as.integer(all_dates))
  data.table::data.table(all_dates, out)  # purely for ease of printing 
}

example_dates <- dqrng::dqsample(seq(as.Date("1900-01-01"), Sys.Date(), by = 1), 
                                 size = 36500,  # not all dates
                                 replace = TRUE)
example_many_dates <- rep_len(example_dates, 1e7)

bench::system_time(Tabulate_Dates(example_many_dates))
# >   0.050 s

# >   2.6   s for tibble method
# > 170     s for aggregate base R method

Upvotes: 2

Andre Wildberg

Reputation: 19163

An efficient base R approach:

The motivation here is to get a match-list of existing dates against all dates simply by using colSum of the resulting TRUE/FALSE matrix.

df <- structure(list(Student = c("Bob", "Joe", "Sam", "John"),
ClassDate = structure(c(18262, 18262, 18263, 18266),
class = "Date")), class = "data.frame", row.names = c(NA, -4L))
#  Student  ClassDate
#1     Bob 2020-01-01
#2     Joe 2020-01-01
#3     Sam 2020-01-02
#4    John 2020-01-05

dates <- seq( df$ClassDate[1], df$ClassDate[nrow(df)], by=1 )
data.frame( dates, cumsum=cumsum( colSums(sapply( dates, function(x) df$ClassDate == x )) ) )

       dates cumsum
1 2020-01-01      2
2 2020-01-02      3
3 2020-01-03      3
4 2020-01-04      3
5 2020-01-05      4

Upvotes: 1

akrun

Reputation: 887571

An option is to create a column of 1s, expand the data with complete by creating a sequence from minimum to maximum value of 'ClassDate' by 'day' while filling the 'n' with 0, then do a group by sum on the 'n' column, and do the cumsum

library(dplyr)
library(tidyr)
df %>% 
   mutate(n = 1) %>% 
   complete(ClassDate = seq(min(ClassDate), max(ClassDate),
            by = '1 day'), fill = list(n = 0)) %>% 
   group_by(ClassDate) %>% 
   summarise(n = sum(n), .groups = 'drop') %>% 
   mutate(n = cumsum(n))

-output

# A tibble: 5 x 2
#  ClassDate      n
#* <date>     <dbl>
#1 2020-01-01     2
#2 2020-01-02     3
#3 2020-01-03     3
#4 2020-01-04     3
#5 2020-01-05     4

In base R, an option is also to specify the levels while converting to factor

v1 <- with(df, factor(ClassDate, levels = 
  as.character(seq(min(ClassDate), max(ClassDate), by = '1 day'))))
data.frame(Cumsum = cumsum(table(v1)))

Upvotes: 2

ThomasIsCoding

Reputation: 102309

A base R option

aggregate(
  cumfreq ~ ClassDate,
  transform(
    merge(
      cbind(df, cumfreq = 1),
      data.frame(
        ClassDate = seq(min(df$ClassDate), max(df$ClassDate), by = "day")
      ),
      all = TRUE
    ),
    cumfreq = cumsum(replace(cumfreq, is.na(cumfreq), 0))
  ),
  max
)

gives

   ClassDate cumfreq
1 2020-01-01       2
2 2020-01-02       3
3 2020-01-03       3
4 2020-01-04       3
5 2020-01-05       4

Upvotes: 1

When making a table for dates in R, how can I fill in the missing dates to appear as having a frequency of 0?

Answers (4)

Related Questions