Chaitu
Chaitu

Reputation: 151

Counting the instances of a variable that exceeds a threshold

I have a dataset with id and speed.

id <- c(1,1,1,1,2,2,2,2,3,3,3)

speed <- c(40,30,50,40,45,50,30,55,50,50,60)

i <- cbind(id, speed)

limit <- 35

Say, if 'speed' crosses 'limit' will count it as 1. And you will count again only if speed comes below and crosses the 'limit'.

I want data to be like.

 id | Speed Viol.
 ----------
 1  |   2
 ---------
 2  |   2
 ---------
 3  |   1
 ---------

here id (count).

 id1 (1) 40 (2) 50,40

 id2 (1) 45,50 (2) 55

 id3 (1) 50,50,60

How to do it not using if().

Upvotes: 1

Views: 811

Answers (4)

d.b
d.b

Reputation: 32548

aggregate(speed~id, data.frame(i), function(x) sum(rle(x>limit)$values))
#  id speed
#1  1     2
#2  2     2
#3  3     1

The main idea is that x > limit will check for instances when the speed limit is violated and rle(x) will group those instances into consecutive violations or consecutive non-violations. Then all you need to do is to count the groups of consecutive violations (when rle(x>limit)$values is TRUE).

Upvotes: 0

lmo
lmo

Reputation: 38510

Here's a method tapply as suggested in the comments and the original vectors.

tapply(speed, id, FUN=function(x) sum(c(x[1] > limit, diff(x > limit)) > 0))
1 2 3 
2 2 1 

tapply applies a function to each group, here, by ID. The function checks if the first element of an ID is over 35, and then concatenates this to the output of diff, whose argument is checking if subsequent observations are greater than 35. Thus diff checks if an ID returns to above 35 after dropping below that level. Negative values in the resulting vector are converted to FALSE (0) with > 0 and these results are summed.

tapply returns a named vector, which can be fairly nice to work with. However, if you want a data.frame, then you could use aggregate instead as suggested by d.b:

aggregate(speed, list(id=id), FUN=function(x) sum(c(x[1] > limit, diff(x > limit)) > 0))
  id x
1  1 2
2  2 2
3  3 1

Upvotes: 5

Dan
Dan

Reputation: 12074

Here's a dplyr solution. I group by id then check if speed is above the limit in each row, but wasn't in the previous entry. (I get the previous row using lag). If this is the case, it produces TRUE. Or, if it's the first row for the id (i.e., row_number()==1) and it's above the limit, this gives a TRUE, too. Then, I sum all the TRUE values for each id using summarise.

id <- c(1,1,1,1,2,2,2,2,3,3,3)   
speed <- c(40,30,50,40,45,50,30,55,50,50,60)
i <- data.frame(id, speed)
limit <- 35

library(dplyr)

i %>% 
  group_by(id) %>% 
  mutate(viol=(speed>limit&lag(speed)<limit)|(row_number()==1&speed>limit)) %>% 
  summarise(sum(viol))

# A tibble: 3 x 2
     id `sum(viol)`
  <dbl>       <int>
1     1           2
2     2           2
3     3           1

Upvotes: 2

Sotos
Sotos

Reputation: 51592

Here is another option with data.table,

library(data.table)

setDT(i)[, id1 := rleid(speed > limit), by = id][
          speed > limit, .(violations = uniqueN(id1)), by = id][]

which gives,

   id violations
1:  1          2
2:  2          2
3:  3          1

Upvotes: 1

Related Questions