Reputation: 3
I was hoping someone knew of an easy/efficient in dplyr in which I can define an indicator variable to take the value of 1 if on Date X, an IP address was present >50 times. The data is two columns, one of IP addresses and the other associated access dates.
As an example, I would like the following output in the Robot column (assuming that the Date/IP combination was >=3).
IP Date Robot
1 A 1
1 A 1
1 A 1
1 B 0
2 B 0
2 C 1
2 C 1
2 C 1
3 C 0
3 D 0
4 A 0
Thanks!
Upvotes: 0
Views: 107
Reputation: 887148
We could use data.table
. Convert the 'data.frame' to 'data.table' (setDT(df1)
), grouped by "IP", and "Date", we create the "Robot" by converting the logical (.N>=3
) to binary representation. This can be done by just using +
to the logical vector or using the function as.integer
.
library(data.table)
setDT(df1)[, Robot:= +(.N>=3), .(IP, Date)]
The +
can be replaced by as.integer
Or with base R
, we can use ave
transform(df1, Robot=as.integer(ave(IP, IP, Date, FUN=length)>=3))
Upvotes: 0
Reputation: 662
For efficiency same logic in data.table
:
library(data.table)
DT <- fread("IP Date
1 A
1 A
1 A
1 B
2 B
2 C
2 C
2 C
3 C
3 D
4 A")
DT[, Robot := ifelse(.N >= 3, 1, 0), keyby = .(IP, Date)]
Of course, you need to change the condition to .N >= 50
when you want 50 the be the threshold.
Upvotes: 0
Reputation: 19867
You can group_by
the two variables and use n()
to test how many adresses where present that day.
group_by(df,date,ip) %>%
mutate(keep=as.numeric(n() > 50))
Upvotes: 4