Reputation: 5068
I have a dataframe df
of stock liquidity (dput
statement below), and I need to measure how many of the previous consecutive three rows the value has been > 10:
date sec_id liquidity good count.good.rows
2016-07-29 3277 9.142245 FALSE 0
2016-08-31 3277 11.070555 TRUE 0
2016-09-30 3277 11.934113 TRUE 1
2016-10-31 3277 12.192237 TRUE 2
2016-11-30 3277 10.165183 TRUE 3
2016-12-30 3277 8.414033 FALSE 3
2016-01-29 3426 6.494181 FALSE 0
2016-02-29 3426 8.216213 FALSE 0
2016-03-31 3426 10.081115 TRUE 0
2016-04-29 3426 10.119685 TRUE 1
2016-05-31 3426 8.659732 FALSE 2
2016-06-30 3426 6.790178 FALSE 1
2016-07-29 3426 7.234159 FALSE 0
Note a few things about the data:
sec_id
values, and I need to do this work on each sec_id
value, based on the order of the data
column.good
column, but can't figure out how to do the count.good.rows
column without explicitly using lag(...,1) + lag(...,2) + lag(...,3)
. This would be a poor solution because I need the 3
to be a variable (I may end up wanting to look at previous 2 rows, or 4 rows).Any ideas?
Here's my full dput
:
df = structure(list(date = structure(c(16829, 16860, 16891, 16920, 16952, 16982, 17011, 17044, 17074, 17105, 17135, 17165, 16829, 16860, 16891, 16920, 16952, 16982, 17011, 17044, 17074, 17105, 17135, 17165), class = "Date"),
sec_id = c(3277L, 3277L, 3277L, 3277L, 3277L, 3277L, 3277L, 3277L, 3277L, 3277L, 3277L, 3277L, 3426L, 3426L, 3426L, 3426L, 3426L, 3426L, 3426L, 3426L, 3426L, 3426L, 3426L, 3426L),
liquidity = c(4.014428, 3.779665, 4.833813, 5.244417, 7.150838, 7.639399, 9.142245, 11.070555, 11.934113, 12.192237, 10.165183, 8.414033, 6.494181, 8.216213, 10.081115, 10.119685, 8.659732, 6.790178, 7.234159, 8.529101, 9.015898, 8.307979, 8.231237, 8.711095),
good = c(FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE)),
class = "data.frame", .Names = c("date", "sec_id", "liquidity", "good"),
row.names = c(NA, -24L))
Upvotes: 1
Views: 224
Reputation: 3875
You could define a lagcounter function which substracts to the cumulative sum of good
the cumulative sum n rows back:
lagcounter = function(x,n) {y = cumsum(x); lag(y-lag(y,n,default=0),default=0)}
And then using a dplyr
syntax, use this newly defined function in a mutate
statement, after having grouped by sec_id
:
library(dplyr)
df %>% group_by(sec_id) %>% mutate(count.good.rows = lagcounter(good,3))
date sec_id liquidity good count.good.rows
1 2016-01-29 3277 4.014428 FALSE 0
2 2016-02-29 3277 3.779665 FALSE 0
3 2016-03-31 3277 4.833813 FALSE 0
4 2016-04-29 3277 5.244417 FALSE 0
5 2016-05-31 3277 7.150838 FALSE 0
6 2016-06-30 3277 7.639399 FALSE 0
7 2016-07-29 3277 9.142245 FALSE 0
8 2016-08-31 3277 11.070555 TRUE 0
9 2016-09-30 3277 11.934113 TRUE 1
10 2016-10-31 3277 12.192237 TRUE 2
11 2016-11-30 3277 10.165183 TRUE 3
12 2016-12-30 3277 8.414033 FALSE 3
13 2016-01-29 3426 6.494181 FALSE 0
14 2016-02-29 3426 8.216213 FALSE 0
15 2016-03-31 3426 10.081115 TRUE 0
16 2016-04-29 3426 10.119685 TRUE 1
17 2016-05-31 3426 8.659732 FALSE 2
18 2016-06-30 3426 6.790178 FALSE 2
19 2016-07-29 3426 7.234159 FALSE 1
20 2016-08-31 3426 8.529101 FALSE 0
21 2016-09-30 3426 9.015898 FALSE 0
22 2016-10-31 3426 8.307979 FALSE 0
23 2016-11-30 3426 8.231237 FALSE 0
24 2016-12-30 3426 8.711095 FALSE 0
Upvotes: 1
Reputation: 13581
Try zoo::rollapply
library(zoo)
df %>%
group_by(sec_id) %>%
mutate(count_good_rows = rollapply(good, 3, sum, align="right", partial=TRUE))
# A tibble: 13 x 5
# Groups: sec_id [2]
# date sec_id liquidity good count_good_rows
# <fctr> <int> <dbl> <lgl> <int>
# 1 2016-07-29 3277 9.14 F 0
# 2 2016-08-31 3277 11.1 T 1
# 3 2016-09-30 3277 11.9 T 2
# 4 2016-10-31 3277 12.2 T 3
# 5 2016-11-30 3277 10.2 T 3
# 6 2016-12-30 3277 8.41 F 2
# 7 2016-01-29 3426 6.49 F 0
# 8 2016-02-29 3426 8.22 F 0
# 9 2016-03-31 3426 10.1 T 1
# 10 2016-04-29 3426 10.1 T 2
# 11 2016-05-31 3426 8.66 F 2
# 12 2016-06-30 3426 6.79 F 1
# 13 2016-07-29 3426 7.23 F 0
Edit If you're only interested in counting the previous three rows
df %>%
group_by(sec_id) %>%
mutate(count_good_rows = rollapply(dplyr::lag(good, 1), 3, function(i) sum(i, na.rm=TRUE), align="right", partial=TRUE))
# A tibble: 13 x 5
# Groups: sec_id [2]
# date sec_id liquidity good count_good_rows
# <fctr> <int> <dbl> <lgl> <int>
# 1 2016-07-29 3277 9.14 F 0
# 2 2016-08-31 3277 11.1 T 0
# 3 2016-09-30 3277 11.9 T 1
# 4 2016-10-31 3277 12.2 T 2
# 5 2016-11-30 3277 10.2 T 3
# 6 2016-12-30 3277 8.41 F 3
# 7 2016-01-29 3426 6.49 F 0
# 8 2016-02-29 3426 8.22 F 0
# 9 2016-03-31 3426 10.1 T 0
# 10 2016-04-29 3426 10.1 T 1
# 11 2016-05-31 3426 8.66 F 2
# 12 2016-06-30 3426 6.79 F 2
# 13 2016-07-29 3426 7.23 F 1
Data
df <- read.table(text="date sec_id liquidity good
2016-07-29 3277 9.142245 FALSE
2016-08-31 3277 11.070555 TRUE
2016-09-30 3277 11.934113 TRUE
2016-10-31 3277 12.192237 TRUE
2016-11-30 3277 10.165183 TRUE
2016-12-30 3277 8.414033 FALSE
2016-01-29 3426 6.494181 FALSE
2016-02-29 3426 8.216213 FALSE
2016-03-31 3426 10.081115 TRUE
2016-04-29 3426 10.119685 TRUE
2016-05-31 3426 8.659732 FALSE
2016-06-30 3426 6.790178 FALSE
2016-07-29 3426 7.234159 FALSE ", header=TRUE)
Upvotes: 1