Reputation: 601

Subsetting a dataframe based on summation of rows of a given column

I am dealing with data with three variables (i.e. id, time, gender). It looks like

df <-
  structure(
    list(
      id = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L),
      time = c(21L, 3L, 4L, 9L, 5L, 9L, 10L, 6L, 27L, 3L, 4L, 10L),
      gender = c(1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L)
    ),
    .Names = c("id", "time", "gender"),
    class = "data.frame",
    row.names = c(NA,-12L)
  )

That is, each id has four observations for time and gender. I want to subset this data in R based on the sums of the rows of variable time which first gives a value which is greater than or equal to 25 for each id. Notice that for id 2 all observations will be included and for id 3 only the first observation is involved. The expected results would look like:

df <-
  structure(
    list(
      id = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L ),
      time = c(21L, 3L, 4L, 5L, 9L, 10L, 6L, 27L ),
      gender = c(1L, 1L, 1L, 0L, 0L, 0L, 0L, 1L)
    ),
    .Names = c("id", "time", "gender"),
    class = "data.frame",
    row.names = c(NA,-8L)
  )

Any help on this is highly appreciated.

Upvotes: 3

Answers (3)

MKR

Reputation: 20085

One option is using lag of cumsum as:

library(dplyr)

df %>% group_by(id,gender) %>%
  filter(lag(cumsum(time), default = 0) < 25 )

# # A tibble: 8 x 3
# # Groups: id, gender [3]
# id  time gender
# <int> <int>  <int>
# 1     1    21      1
# 2     1     3      1
# 3     1     4      1
# 4     2     5      0
# 5     2     9      0
# 6     2    10      0
# 7     2     6      0
# 8     3    27      1

Using data.table: (Updated based on feedback from @Renu)

library(data.table)

setDT(df)

df[,.SD[shift(cumsum(time), fill = 0) < 25], by=.(id,gender)]

Upvotes: 2

markus

Reputation: 26343

Another option would be to create a logical vector for each 'id', cumsum(time) >= 25, that is TRUE when the cumsum of 'time' is equal to or greater than 25.

Then you can filter for rows where the cumsum of this vector is less or equal then 1, i.e. filter for entries until the first TRUE for each 'id'.

df %>% 
 group_by(id) %>% 
 filter(cumsum( cumsum(time) >= 25 ) <= 1)
# A tibble: 8 x 3
# Groups:   id [3]
#      id  time gender
#   <int> <int>  <int>
# 1     1    21      1
# 2     1     3      1
# 3     1     4      1
# 4     2     5      0
# 5     2     9      0
# 6     2    10      0
# 7     2     6      0
# 8     3    27      1

Upvotes: 1

NT_

Reputation: 641

Can try dplyr construction:

dt <- groupby(df, id) %>%
#sum time within  groups
mutate(sum_time = cumsum(time))%>% 
#'select' rows, which fulfill the condition
filter(sum_time < 25) %>% 
#exclude sum_time column from the result
select (-sum_time)

Upvotes: 1

Subsetting a dataframe based on summation of rows of a given column

Answers (3)

Related Questions