Reputation: 1773
I have a dataframe of the following format and I would like to extract or subset the data frame such that I have only activities prior to the first funding
activity in each project:
project<- c('A', 'A', 'A', 'B', 'B', 'B','B', 'C', 'C')
activity<- c('kickoff','funding', 'delivery', 'kickoff','kickoff','funding','kickoff', 'kickoff','delivery')
df<- data.frame(project,activity)
I am expecting an output as follows:
project activity
A kickoff
B kickoff
B kickoff
C kickoff
C delivery
Any suggestions?
Upvotes: 2
Views: 143
Reputation: 83275
Some other alternatives with the data.table
package:
1) with Reduce
:
library(data.table)
setDT(df)[df[, .I[!Reduce('+', activity == 'funding', accumulate = TRUE)], project]$V1]
2) with cummax
:
library(data.table)
setDT(df)[df[, .I[!cummax(activity == 'funding')], project]$V1]
3) with pmax
:
library(data.table)
setDT(df)[!df[, pmax(.I, .I[activity == 'funding']), by = project]$V1]
Upvotes: 2
Reputation: 42592
For the sake of completeness, here is also a data.table
solution:
library(data.table)
setDT(df)[!df[, .I[.I >= first(.I[activity == 'funding'])], by = project]$V1]
project activity 1: A kickoff 2: B kickoff 3: B kickoff 4: C kickoff 5: C delivery
Within each project
group, we look for the indices of the first appearance of "funding"
in column activity
and all subsequent rows:
df[, .I[.I >= first(.I[activity == 'funding'])], by = project]
project V1 1: A 2 2: A 3 3: B 6 4: B 7
In data.table
, .I
is a special symbol which holds the row location in df
. The second subsetting .I[.I >= first(.I[activity == 'funding'])]
is required because which(.I >= first(.I[activity == 'funding']))
would return only the row locations within the group but not within df
.
Now, we have indentified the rows which should not be displayed. Therefore, we get the final result by excluding these row numbers:
df[!df[, .I[.I >= first(.I[activity == 'funding'])], by = project]$V1]
In case there are date information available - and I bet there is a date
column when dealing with projects and activities - we can follow a suggestion by @Frank and do an anti non-equi join using the date column:
# create sample date with date column
project<- c('A', 'A', 'A', 'B', 'B', 'B','B', 'C', 'C')
activity<- c('kickoff','funding', 'delivery', 'kickoff','kickoff','funding','kickoff', 'kickoff','delivery')
date <- (as.Date ("2017-10-02") + c(1,4,7,2,5,8,11,3,6))
df <- data.frame(project,activity, date, stringsAsFactors = FALSE)
df <- df[order(df$date), ]
project activity date 1 A kickoff 2017-10-03 4 B kickoff 2017-10-04 8 C kickoff 2017-10-05 2 A funding 2017-10-06 5 B kickoff 2017-10-07 9 C delivery 2017-10-08 3 A delivery 2017-10-09 6 B funding 2017-10-10 7 B kickoff 2017-10-13
# anti non-equi join
setDT(df)[!df[activity == 'funding', first(date), by = project], on = .(project, date >= V1)]
project activity date 1: A kickoff 2017-10-03 2: B kickoff 2017-10-04 3: B kickoff 2017-10-07 4: C kickoff 2017-10-05 5: C delivery 2017-10-08
Upvotes: 2
Reputation: 13294
dplyr
:
df %>%
group_by(project) %>%
dplyr::filter(cummin(activity != "funding") == 1)
yields:
# project activity
# <fctr> <fctr>
# 1 A kickoff
# 2 B kickoff
# 3 B kickoff
# 4 C kickoff
# 5 C delivery
base R
:
do.call(rbind, lapply(split(dff, dff$project), function(x) {
x[cummin(x$activity != "funding") == 1, ]
}))
yields:
# project activity
# A kickoff
# B kickoff
# B kickoff
# C kickoff
# C delivery
I hope this helps.
Upvotes: 2
Reputation: 29125
You can try cumsum
to track whether for each project, a row takes place before or after funding:
library(dplyr)
df %>%
group_by(project) %>%
mutate(before.funding = cumsum(activity == "funding") == 0) %>%
ungroup() %>%
filter(before.funding) %>%
select(-before.funding)
# A tibble: 5 x 2
project activity
<fctr> <fctr>
1 A kickoff
2 B kickoff
3 B kickoff
4 C kickoff
5 C delivery
Upvotes: 0