bikeclub
bikeclub

Reputation: 379

How do I plot a running cumulative total from individual records in R?

I have a dataset where different cities go in and out of a program, like this example dataset:

example.dat <- data.frame (c(1000, 2000, 3000), c("15-10-01", "16-05-01", "16-07-01"), c("16-06-01", "16-10-01", "17-08-01"))
colnames(example.dat) <- c("Population", "Enter.Program", "Leave.Program")

This gives you a data frame that looks like:

Population  Enter.Program Leave.Program
1000        15-10-01      16-06-01
2000        16-05-01      16-10-01
3000        16-07-01      17-08-01

First, I'd like to create an output table like this:

Per.Begin   Per.End     Total.Pop.In
15-10-01    16-04-30    1000
16-05-01    16-05-30    3000
16-06-01    16-06-30    2000
16-07-01    16-09-30    5000
16-10-01    17-07-30    3000
17-08-01    18-04-26    0 

And then plot this in ggplot as a graph that looks like either a step function or kind of like a jagged rectangular surface where the top edge is the running total, kind of like a cumulative density function but where the y-axis can go down as well as up, and where the x-axis goes in steps that are the width of time periods.

Here are the steps I've blocked out, but I don't know how to execute:

  1. Make column of unique dates
  2. Make column of period change end dates (i.e., next unique date minus one day)
  3. Calculate running sum of the cities within each period (i.e., third column)
  4. Plot in ggplot

Upvotes: 1

Views: 1371

Answers (1)

Mark
Mark

Reputation: 4537

Using dplyr (because you tagged the question with it) you can do what you want. The main things that need to happen are:

  1. Break out your entries and exits making your population positive and negative.
  2. Get all the dates from your earliest to your last so you can have the desired blocky lines. It is probably possible to do this without every date, but this is easy and requires less thinking.

Code is below

library(dplyr)
library(ggplot2)

example.dat <- data.frame (c(1000, 2000, 3000), c("15-10-01", "16-05-01", "16-07-01"), c("16-06-01", "16-10-01", "17-08-01"))
colnames(example.dat) <- c("Population", "Enter.Program", "Leave.Program")

changes = example.dat %>% 
  select("Population","Date"="Enter.Program") %>% 
  bind_rows(example.dat %>% 
              select("Population","Date"="Leave.Program") %>% 
              mutate(Population = -1*Population)) %>% 
  mutate(Date = as.Date(Date,"%y-%m-%d"))

startDate = min(changes$Date)
endDate = max(changes$Date)
final = data_frame(Date = seq(startDate,endDate,1)) %>% 
  left_join(changes,by="Date") %>% 
  mutate(Population = cumsum(ifelse(is.na(Population),0,Population)))

ggplot(data = final,aes(x=Date,y=Population)) +
  geom_line()

UPDATE

If you don't want to have every date from the earliest to the latest, you can use a blurgh for loop to add the needed rows to get a pretty result. Here we walk through and duplicate each date after the first with the preceding cumulative sum. It's not pretty, but it makes the graph.

library(dplyr)
library(ggplot2)

example.dat <- data.frame (c(1000, 2000, 3000), c("15-10-01", "16-05-01", "16-07-01"), c("16-06-01", "16-10-01", "17-08-01"))
colnames(example.dat) <- c("Population", "Enter.Program", "Leave.Program")

changes = example.dat %>% 
  select("Population","Date"="Enter.Program") %>% 
  bind_rows(example.dat %>% 
              select("Population","Date"="Leave.Program") %>% 
              mutate(Population = -1*Population)) %>% 
  mutate(Date = as.Date(Date,"%y-%m-%d")) %>% 
  arrange(Date) %>% 
  mutate(Population = cumsum(Population))

for(i in nrow(changes):2){
  changes = bind_rows(changes[1:(i-1),],
            data_frame(Population = changes$Population[i-1],Date = changes$Date[i]),
            changes[i:nrow(changes),])
}

ggplot(data = changes,aes(x=Date,y=Population)) +
  geom_line()

Upvotes: 1

Related Questions