Andrew Ma
Andrew Ma

Reputation: 421

R programming - Split up a group of time series indexed by ID with irregular observation periods into regular monthly observations

I have a set of data regarding amounts of something users with unique IDs used between in a data.frame in r.

ID        start date         end date        amount
1         1-15-2012          2-15-2012       6000
1         2-15-2012          3-25-2012       4000
1         3-25-2012          5-26-2012       3000
1         5-26-2012          6-13-2012       1000
2         1-16-2012          2-27-2012       7000
2         2-27-2012          3-18-2012       2000
2         3-18-2012          5-23-2012       3000
 ....
10000     1-12-2012          2-24-2012       12000
10000     2-24-2012          3-11-2012       22000
10000     3-11-2012          5-27-2012       33000
10000     5-27-2012          6-10-2012       5000    

The time series for each ID starts and ends at inconsistent times, and contain an inconsistent number of observations. However, they are all formatted in the above manner; the start and end dates are Date objects.

I would like to standardize the breakdowns for each ID to a monthly time series, with data points at the start of each month, weighing the observed amount numbers which happen to straddle two or more months accordingly. In other words, I would like to turn this series into something like

ID        start date         end date        amount
1         1-1-2012          2-1-2012       3096 = 6000 * 16/31
1         2-1-2012          3-1-2012       4339 = 6000*15/31+4000*14/39
1         3-1-2012          4-1-2012       etc
 ....
1         6-1-2012          7-1-2012       etc
2         1-1-2012          2-1-2012       etc
2         2-1-2012          3-1-2012       etc
2         3-1-2012          4-1-2012       etc
2         4-1-2012          5-1-2012       etc
2         5-1-2012          6-1-2012       etc
 ....
10000     1-1-2012          2-1-2012       etc
 ....
10000     6-1-2012          7-1-2012       etc

Where the value for ID 1 between 2/1/12 and 3/1/12 is calculated by weighing the number of days in the 1-15-2012 to 2-15-2012 observation that land in February (15 days / 31 days) with the amount in that observation span (6000) with the number of days in the 2-15 to 3-25 observation span that fall in February (14 days/ 39 days, as 2012 was a leap year) times the amount in that observation span (4000), yielding 6000*15/31+4000*14/39 = 4339. This should be done for each ID time series. We do not consider the case where the observation periods all fit into one month; but if they are spread out over more than two months they should be split up over that number of months with the appropriate weighings.

I'm rather new to r and could certainly use some help on this!

Upvotes: 1

Views: 1696

Answers (3)

oshun
oshun

Reputation: 2349

Here's a solution using plyr and reshape. The numbers aren't the same as what you provided, so I may have misunderstood your intent though this seems to meet your stated goal (weighted average of amount by month).

df$index <- 1:nrow(df) #Create a unique index number

#Format the dates from factors to dates
df$start.date <- as.Date(df$start.date, format="%m/%d/%Y")
df$end.date <- as.Date(df$end.date, format="%m/%d/%Y")

library(plyr); library(reshape)  #Load the libraries

#dlaply = (d)ataframe to (l)ist using (ply)r
#Subset on dataframe by "index" and perform a function on each subset called "X"
#Create a list containing:
#    ID, each day from start to end date, amount recorded over that day
df2 <- dlply(df, .(index), function(X) { 
  ID <- X$ID        #Keep the ID value
  n.days <- as.numeric(difftime( X$end.date, X$start.date ))  #Calculate time difference in days, report the result as a number
  day <- seq(X$start.date, X$end.date, by="days")   #Sequence of days
  amount.per.day <- X$amount/n.days      #Amount for that day
  data.frame(ID, day, amount.per.day)    #Last line is the output
})

#Change list back into data.frame
df3 <- ldply(df2, data.frame)   #ldply = (l)ist to (d)ataframe using (ply)r
df3$mon <-  as.numeric(format(df3$day, "%m"))   #Assign a month to all dates

#Summarize by each ID and month: add up the daily amounts
ddply(df3, .(ID, mon), summarise, amount = sum(amount.per.day))

#       ID mon    amount
#    1   1   1 3290.3226
#    2   1   2 4441.6873
#    3   1   3 2902.8122
#    4   1   4 1451.6129
#    5   1   5 1591.3978
#    6   1   6  722.2222
#    7   2   1 2666.6667
#    8   2   2 4800.0000
#    9   2   3 2436.3636
#    10  2   4 1363.6364
#    11  2   5 1045.4545

Incidentally, for future posts, you can get faster answers if you provide the code to replicate your data. If your code is somewhat complicated, you can use dput(yourdata). HTH!

Upvotes: 0

Robert
Robert

Reputation: 5152

Here is using native R:

#The data
df=read.table(text='ID        start_date         end_date        amount
1         1-15-2012          2-15-2012       6000
1         2-15-2012          3-25-2012       4000
1         3-25-2012          5-26-2012       3000
1         5-26-2012          6-13-2012       1000
2         1-16-2012          2-27-2012       7000
2         2-27-2012          3-18-2012       2000
2         3-18-2012          5-23-2012       3000
10000     1-12-2012          2-24-2012       12000
10000     2-24-2012          3-11-2012       22000
10000     3-11-2012          5-27-2012       33000
10000     5-27-2012          6-10-2012       5000',
              header=T,row.names = NULL,stringsAsFactors =FALSE)

df[,2]=as.Date(df[,2],"%m-%d-%Y")
df[,3]=as.Date(df[,3],"%m-%d-%Y")

df1=data.frame(n=1:length(df$ID),ID=df$ID)
df1$startm=as.Date(levels(cut(df[,2],"month"))[cut(df[,2],"month")],"%Y-%m-%d")
df1$endm=as.Date(levels(cut(df[,3],"month"))[cut(df[,3],"month")],"%Y-%m-%d")
df1=df1[,-1]
#compute days in month and total days
df$dayin=as.numeric((df1$endm-1)-df$start_date)
df$daytot=as.numeric(df$end_date-df$start_date)
#separate amount this month and next month
df$ammt=df$amount*df$dayin/df$daytot
df$ammt.1=df$amount*(df$daytot-df$dayin)/df$daytot

#using by compute new amount
df1$amount=do.call(c,
  by(df[,c("ammt","ammt.1")],df$ID,function(d)d[,1]+c(0,d[-nrow(d),2]))
        )
df1

> df1
      ID     startm       endm    amount
1      1 2012-01-01 2012-02-01  3096.774
2      1 2012-02-01 2012-03-01  4339.123
3      1 2012-03-01 2012-05-01  4306.038
4      1 2012-05-01 2012-06-01  1535.842
5      2 2012-01-01 2012-02-01  2500.000
6      2 2012-02-01 2012-03-01  4700.000
7      2 2012-03-01 2012-05-01  3754.545
8  10000 2012-01-01 2012-02-01  5302.326
9  10000 2012-02-01 2012-03-01 13572.674
10 10000 2012-03-01 2012-05-01 36553.571
11 10000 2012-05-01 2012-06-01 13000.000

Upvotes: 1

chappers
chappers

Reputation: 2415

To solve this I think the easiest way is to break it down into two problems.

  1. How can I get a daily breakdown of the figures I'm interested in? This is my assumption based on the information you provided above.
  2. How do I group by a date range and summarise to what I'm interested in?

For the following example, I will use the data set which I created using the code below:

df <- data.frame(
  id=c(1,1,1,1,2,2,2),
  start_date=as.Date(c("1-15-2012",
                       "2-15-2012",
                       "3-25-2012",
                       "5-26-2012",
                       "1-16-2012",
                       "2-27-2012",
                       "3-18-2012"), "%m-%d-%Y"),
  end_date=as.Date(c("2-15-2012",
                     "3-25-2012",
                     "5-26-2012",
                     "6-13-2012",
                     "2-27-2012",
                     "3-18-2012",
                     "5-23-2012"), "%m-%d-%Y"),
  amount=c(6000,
           4000,
           3000,
           1000,
           7000,
           2000,
           3000)  
  )

1. Provide daily figures

To provide the daily figures, firstly we get the daily contribution:

df$daily_contribution = df$amount/as.numeric(df$end_date - df$start_date)  

Then, we will expand the date range using the start and end dates. There are a couple ways which you can do it, but seeing that you apply the dplyr tag, using the dplyr way we have:

library(dplyr)
df <- df %>%
  rowwise() %>%
  do(data.frame(id=.$id, 
                date=as.Date(seq(from=.$start_date, to=(.$end_date), by="day")), 
                daily_contribution=.$daily_contribution))

which has some output which looks like this:

Source: local data frame [285 x 3]
Groups: <by row>

   id       date daily_contribution
1   1 2012-01-15           193.5484
2   1 2012-01-16           193.5484
3   1 2012-01-17           193.5484
4   1 2012-01-18           193.5484
5   1 2012-01-19           193.5484
6   1 2012-01-20           193.5484
7   1 2012-01-21           193.5484
8   1 2012-01-22           193.5484
9   1 2012-01-23           193.5484
10  1 2012-01-24           193.5484
.. ..        ...                ...

2. Create a grouping variable

Next we create some kind of grouping variable that we're interested in. I've used lubridate for ease to get the month and year of the dates:

library(lubridate)
df$mnth=month(df$date)
df$yr=year(df$date)

Now with all of this we can easily use dplyr to summarise our information by the dates as required.

df %>% 
  group_by(id, mnth, yr) %>%
  summarise(amount=sum(daily_contribution))

with output:

Source: local data frame [11 x 4]
Groups: id, mnth

   id mnth   yr    amount
1   1    1 2012 3290.3226
2   1    2 2012 4441.6873
3   1    3 2012 2902.8122
4   1    4 2012 1451.6129
5   1    5 2012 1591.3978
6   1    6 2012  722.2222
7   2    1 2012 2666.6667
8   2    2 2012 4800.0000
9   2    3 2012 2436.3636
10  2    4 2012 1363.6364
11  2    5 2012 1045.4545

To get it precisely in the format you specified:

df %>% rowwise() %>%
  mutate(start_date=as.Date(ISOdate(yr, mnth, 1)),
         end_date=as.Date(ISOdate(yr, mnth+1, 1))) %>%
  select(id, start_date, end_date, amount)

with output:

Source: local data frame [11 x 4]
Groups: <by row>

   id start_date   end_date    amount
1   1 2012-01-01 2012-02-01 3290.3226
2   1 2012-02-01 2012-03-01 4441.6873
3   1 2012-03-01 2012-04-01 2902.8122
4   1 2012-04-01 2012-05-01 1451.6129
5   1 2012-05-01 2012-06-01 1591.3978
6   1 2012-06-01 2012-07-01  722.2222
7   2 2012-01-01 2012-02-01 2666.6667
8   2 2012-02-01 2012-03-01 4800.0000
9   2 2012-03-01 2012-04-01 2436.3636
10  2 2012-04-01 2012-05-01 1363.6364
11  2 2012-05-01 2012-06-01 1045.4545

as needed.

note: I can see from your example, that you have, 3096 = 6000 * 16/31 and 4339 = 6000*15/31+4000*14/39, but for the first one, as an example, you have 15 of Jan to 31 of Jan which is 17 days if the date range is inclusive. You can trivially alter this information if required.

Upvotes: 0

Related Questions