Reputation: 39613

Remove rows in a data frame when the sum over a given column is greater than a threshold

I have a database that has a structure like this:

alfa beta gama z
0001  12a   1  10
0002  13q   1  100
0001  1a    1  70
0001  19p   2  30
0003   2a   2  10

My problem is I don't know how to remove rows that have alfa code of 0001 when the sum of those codes in z column exceeds a value equal to 80, and I would like to save these rows in a new data frame like this:

alfa beta gama z
0001  12a   1  10
0002  13q   1  100
0001  1a    1  70
0003  2a   2  10

Where the sum of 0001 codes got a value less or equal to 80. Thanks.

Upvotes: 1

Answers (3)

Henrik

Reputation: 67818

df2 <- df[df$alfa == 0001, ]
df[-as.numeric(rownames(df2[!cumsum(df2$z) <= 80, ])), ]

Upvotes: 0

Frank

Reputation: 2416

You want to make a new column with the cumsum of z, by alfa, and then subset based on that. There are a few ways to do this, and off the top of my head I can name two:

Here is a data.table solution, using by with := to make the new column, and subsetting based on that column:

library(data.table)
f <- data.table(f)
f[, cum.z := cumsum(z), by = alfa]
new.f <- f[!(alfa == 0001 & cum.z > 80)]

and if you don't want to keep the new variable,

new.f[, cum.z := NULL]

Here is the same thing, implemented with plyr:

library(plyr)
f <- ddply(f, .(alfa), transform,
           cum.z = cumsum(z))
new.f <- f[!(f$alfa == 0001 & f$cum.z > 80),]

and to delete the new column,

new.f$cum.z <- NULL

Upvotes: 1

pabaldonedo

Reputation: 957

You can try this:

#Setting the variables
alfa = c(0001, 0002, 0001, 0001, 0003)
beta = c('12a', '13q', '1a', '19q', '2a')
gama = c(1,1,1,2,2)
z = c(10,100,70,30,10)
f = data.frame(alfa,beta,gama,z)
#actual selection
tmp = f$z
tmp[f$alfa != 0001] = 0
zcum = cumsum(tmp)
index = f$alfa != 0001 | (f$alfa == 0001 & zcum<= 80)
new_frame = f[index,]

Upvotes: 0

Remove rows in a data frame when the sum over a given column is greater than a threshold

Answers (3)

Related Questions