gibbz00
gibbz00

Reputation: 1987

Removing rows of groups based on different row conditions

This is what my dataframe looks like.

dt <- read.table(text='

                         Name      ActivityType   GrpID                 
                         John       Sale           1            
                         John       Sale           2              
                         John       Webinar        3           
                         Kyle       Email          1       
                         Kyle       Seminar        2            
                         Kyle       Sale           3          
                         Kyle       Webinar        4
                         Kyle       Sale           5        
                         Tom        Email          1             
                         Tom        Video          2            
                         Tom        Seminar        3           
                                             ', header=T, row.names = NULL)

I want to do 3 things.

  1. Removing rows where the group's(Name is the group) first ActivityType is "Sale". That removes rows where Name = John
  2. Removing rows where there is no ActivityType = Sale. That removes rows where Name = Tom
  3. Return the remaining group where the first ActivityType is NOT "Sale" BUT there is an ActivityType = "Sale" in some later row(as described in 1 & 2) AND only show the results upto the first instance of ActivityType = Sale. So it should show

                Name      ActivityType   GrpID
                 Kyle       Email          1       
                 Kyle       Seminar        2            
                 Kyle       Sale           3 
    

    It does not have to be 3 steps as described. I just need the final output. I was thinking about using SD function in data.table but dont know how to add those conditions. I will sincerely appreciate your help.

Upvotes: 1

Views: 127

Answers (3)

steveb
steveb

Reputation: 5532

Using dplyr this will work on your example above

dt %>%
    group_by(Name) %>%
    filter( sum((GrpID == 1 & ActivityType=='Sale')) == 0 ) %>%
    filter( sum(ActivityType=='Sale') > 0 ) %>%
    filter( GrpID <= min(GrpID[ActivityType == 'Sale'])) %>%
    ungroup

#Source: local data frame [3 x 3]
#
#    Name ActivityType GrpID
#  (fctr)       (fctr) (int)
#1   Kyle        Email     1
#2   Kyle      Seminar     2
#3   Kyle         Sale     3

There may be a cleaner way to do this though.

EDIT: I added the output and ungroup to remove the grouping.

EDIT 2: Based on the suggestion from MichaelChirico

dt %>%
    group_by(Name) %>%
    filter( !any(ActivityType == 'Sale' & GrpID == 1) )  %>% # 1
    filter(  any(ActivityType == 'Sale') )               %>% # 2
    filter( GrpID <= min(GrpID[ActivityType == 'Sale'])) %>% # 3
    ungroup

The above solution uses any instead of sum (the %>% is the pipe operator). That is not to say this can't be made more efficient though. I will gladly update this again if someone suggests a more efficient and / or more readable dplyr solution.

EDIT 3

Below is an alternate solution for item 3 based on a comment/solution by @MichaelChirico. This combined all 3 conditions in one filter statement (not using the progressive filtering above).

dt %>%
    group_by(Name) %>%
    mutate(x = (ActivityType == 'Sale') ) %>%
    filter( !x[1],
            any(x),
            row_number() <= which.max(x))  %>%
    ungroup %>%
    select(-x)

Upvotes: 3

Rick
Rick

Reputation: 898

# For those who prefer to roll their own
result.list <- by(dt, dt$Name, FUN = function(x) {
  f <- match("Sale", x$ActivityType)
  if(!is.na(f) & (f != 1) ) return(head(x, f))
})

result.df <- do.call(rbind, result.list)

Upvotes: 1

MichaelChirico
MichaelChirico

Reputation: 34703

In data.table:

setDT(dt)

sl <- "Sale" #since we re-use it so much...
#1)
dt[ , if (!ActivityType[1L] == sl) .SD, by = Name]

#2)
dt[ , if (any(ActivityType == sl)) .SD, by = Name]

#3)
dt[ , {x <- ActivityType == sl; if(!x[1] & any(x)) .SD[1:which.max(x)]}, by = Name]

(note that the third case subsumes the first two, so I'm assuming you want three different outputs... otherwise just stick to the last)

Upvotes: 7

Related Questions