user3570187
user3570187

Reputation: 1773

Subset dataframe after first encounter of a specific string

I have a dataframe of the following format and I would like to extract or subset the data frame such that I have only activities prior to the first funding activity in each project:

 project<- c('A', 'A', 'A', 'B', 'B', 'B','B', 'C', 'C')
 activity<- c('kickoff','funding', 'delivery', 'kickoff','kickoff','funding','kickoff', 'kickoff','delivery')

 df<- data.frame(project,activity)

I am expecting an output as follows:

 project   activity 
 A         kickoff
 B         kickoff
 B         kickoff
 C         kickoff
 C         delivery

Any suggestions?

Upvotes: 2

Views: 143

Answers (4)

Jaap
Jaap

Reputation: 83275

Some other alternatives with the data.table package:

1) with Reduce:

library(data.table)
setDT(df)[df[, .I[!Reduce('+', activity == 'funding', accumulate = TRUE)], project]$V1]

2) with cummax:

library(data.table)
setDT(df)[df[, .I[!cummax(activity == 'funding')], project]$V1]

3) with pmax:

library(data.table)
setDT(df)[!df[, pmax(.I, .I[activity == 'funding']), by = project]$V1]

Upvotes: 2

Uwe
Uwe

Reputation: 42592

For the sake of completeness, here is also a data.table solution:

library(data.table)
setDT(df)[!df[, .I[.I >= first(.I[activity == 'funding'])], by = project]$V1]
   project activity
1:       A  kickoff
2:       B  kickoff
3:       B  kickoff
4:       C  kickoff
5:       C delivery

Explanation

Within each project group, we look for the indices of the first appearance of "funding" in column activity and all subsequent rows:

df[, .I[.I >= first(.I[activity == 'funding'])], by = project]
   project V1
1:       A  2
2:       A  3
3:       B  6
4:       B  7

In data.table, .I is a special symbol which holds the row location in df. The second subsetting .I[.I >= first(.I[activity == 'funding'])] is required because which(.I >= first(.I[activity == 'funding'])) would return only the row locations within the group but not within df.

Now, we have indentified the rows which should not be displayed. Therefore, we get the final result by excluding these row numbers:

df[!df[, .I[.I >= first(.I[activity == 'funding'])], by = project]$V1]

In case there are date information available - and I bet there is a date column when dealing with projects and activities - we can follow a suggestion by @Frank and do an anti non-equi join using the date column:

# create sample date with date column
project<- c('A', 'A', 'A', 'B', 'B', 'B','B', 'C', 'C')
activity<- c('kickoff','funding', 'delivery', 'kickoff','kickoff','funding','kickoff', 'kickoff','delivery')
date <- (as.Date ("2017-10-02") + c(1,4,7,2,5,8,11,3,6))
df <- data.frame(project,activity, date, stringsAsFactors = FALSE)
df <- df[order(df$date), ]
  project activity       date
1       A  kickoff 2017-10-03
4       B  kickoff 2017-10-04
8       C  kickoff 2017-10-05
2       A  funding 2017-10-06
5       B  kickoff 2017-10-07
9       C delivery 2017-10-08
3       A delivery 2017-10-09
6       B  funding 2017-10-10
7       B  kickoff 2017-10-13
# anti non-equi join
setDT(df)[!df[activity == 'funding', first(date), by = project], on = .(project, date >= V1)]
   project activity       date
1:       A  kickoff 2017-10-03
2:       B  kickoff 2017-10-04
3:       B  kickoff 2017-10-07
4:       C  kickoff 2017-10-05
5:       C delivery 2017-10-08

Upvotes: 2

Abdou
Abdou

Reputation: 13294

dplyr:

df %>%
    group_by(project) %>%
    dplyr::filter(cummin(activity != "funding") == 1)

yields:

# project activity
# <fctr>   <fctr>
# 1       A  kickoff
# 2       B  kickoff
# 3       B  kickoff
# 4       C  kickoff
# 5       C delivery

base R:

do.call(rbind, lapply(split(dff, dff$project), function(x) {
    x[cummin(x$activity != "funding") == 1, ]
}))

yields:

# project activity
# A       kickoff 
# B       kickoff 
# B       kickoff 
# C       kickoff 
# C       delivery

I hope this helps.

Upvotes: 2

Z.Lin
Z.Lin

Reputation: 29125

You can try cumsum to track whether for each project, a row takes place before or after funding:

library(dplyr)

df %>%
  group_by(project) %>%
  mutate(before.funding = cumsum(activity == "funding") == 0) %>%
  ungroup() %>%
  filter(before.funding) %>%
  select(-before.funding)

# A tibble: 5 x 2
  project activity
   <fctr>   <fctr>
1       A  kickoff
2       B  kickoff
3       B  kickoff
4       C  kickoff
5       C delivery

Upvotes: 0

Related Questions