Reputation: 79
I have a dataframe df1 with information on the number of acquisitions that a company has made during a certain year. I would need to
a) construct a dummy variable if there are observations available three consecutive preceding years for each company each year
b) if there are three consecutive preceding years for that company-year, then sum the number of acquisitions made during that three-year period
df1 <- data.frame(ID=c('XXXX-1999','XXXX-2000', 'XXXX-2001', 'YYYY-1999',
'YYYY-2000', 'ZZZZ-1999','ZZZZ-2000','ZZZZ-2001', 'ZZZZ-2002'),
No.of.Transactions=c(1,0,2,2,2,4,1,0,3))
where ID is the observation for a company during a year. The desired output is below
# Desired output
# ID | No.of.Transactions | 3 preceding yrs available dummy? |
No.of.Transactions during 3 preceding yrs
# XXXX-1999 1 0 N/A
# XXXX-2000 0 0 N/A
# XXXX-2001 2 1 3
# YYYY-1999 2 0 N/A
# YYYY-2000 2 0 N/A
# ZZZZ-1999 4 0 N/A
# ZZZZ-2000 1 0 N/A
# ZZZZ-2001 0 1 5
# ZZZZ-2002 3 1 4
So if the "3 preceding yrs available dummy?" column takes a value of 1, then the final column should sum up all the transactions for the company during the focal and two preceding years.
Thank you in advance!
Upvotes: 1
Views: 259
Reputation: 34376
You could use a combination of ave
and zoo::rollsumr
. If you still need the dummy variable you could easily create it from the transaction sum variable.
library(zoo)
df1$trans.sum <- with(df1, ave(No.of.Transactions, sub("(^.{4}).*", "\\1", ID),
FUN = function(x) rollsumr(x, 3, fill = NA)))
df1
ID No.of.Transactions trans.sum
1 XXXX-1999 1 NA
2 XXXX-2000 0 NA
3 XXXX-2001 2 3
4 YYYY-1999 2 NA
5 YYYY-2000 2 NA
6 ZZZZ-1999 4 NA
7 ZZZZ-2000 1 NA
8 ZZZZ-2001 0 5
9 ZZZZ-2002 3 4
Upvotes: 1
Reputation: 7611
How's this? I'm not overly happy with the three_year_trans = trans + lag(trans, 1) + lag(trans, 2)
bit, but it's the best I've got off the top of my head.
In case it's not obvious, the lag(year, 2, default = 0) == year - 2
bit ensures there are no missing years (for example, if company XXXX
had XXXX-1999
, XXXX-2001
, XXXX-2002
, there'd be no totals for 2002, as 2000 is missing.
library(dplyr)
library(tidyr)
df1 <- data.frame(ID=c('XXXX-1999','XXXX-2000', 'XXXX-2001', 'YYYY-1999',
'YYYY-2000', 'ZZZZ-1999','ZZZZ-2000','ZZZZ-2001', 'ZZZZ-2002'),
trans=c(1,0,2,2,2,4,1,0,3))
df1 %>%
separate(ID, c("company", "year"), "-") %>%
mutate(year = as.integer(year)) %>%
group_by(company) %>%
arrange(year) %>%
mutate(three_years_available = (lag(year, 2, default = 0) == year - 2) + 0,
three_year_trans = if_else(three_years_available == 1,
trans + lag(trans, 1) + lag(trans, 2),
NA_real_)
) %>%
ungroup() %>%
arrange(company, year)
Upvotes: 1