ew23
ew23

Reputation: 3

Subsetting by multiple aggregate conditions in dplyr

I was hoping someone knew of an easy/efficient in dplyr in which I can define an indicator variable to take the value of 1 if on Date X, an IP address was present >50 times. The data is two columns, one of IP addresses and the other associated access dates.

As an example, I would like the following output in the Robot column (assuming that the Date/IP combination was >=3).

IP Date Robot
1   A   1
1   A   1
1   A   1
1   B   0
2   B   0
2   C   1
2   C   1
2   C   1
3   C   0
3   D   0
4   A   0

Thanks!

Upvotes: 0

Views: 107

Answers (3)

akrun
akrun

Reputation: 887148

We could use data.table. Convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by "IP", and "Date", we create the "Robot" by converting the logical (.N>=3) to binary representation. This can be done by just using + to the logical vector or using the function as.integer.

library(data.table)
setDT(df1)[, Robot:= +(.N>=3), .(IP, Date)]

The + can be replaced by as.integer


Or with base R, we can use ave

transform(df1, Robot=as.integer(ave(IP, IP, Date, FUN=length)>=3))

Upvotes: 0

Alexander Radev
Alexander Radev

Reputation: 662

For efficiency same logic in data.table:

library(data.table)

DT <- fread("IP Date
            1   A   
            1   A   
            1   A   
            1   B   
            2   B   
            2   C   
            2   C   
            2   C   
            3   C   
            3   D   
            4   A")

DT[, Robot := ifelse(.N >= 3, 1, 0), keyby = .(IP, Date)]

Of course, you need to change the condition to .N >= 50 when you want 50 the be the threshold.

Upvotes: 0

scoa
scoa

Reputation: 19867

You can group_by the two variables and use n() to test how many adresses where present that day.

group_by(df,date,ip) %>% 
  mutate(keep=as.numeric(n() > 50))

Upvotes: 4

Related Questions