Reputation: 789
I have a dataset that has the IDs of certain individuals that tried to breach a private network connection, there are only two types of breaches "X" and "Y" and the dataset will let you know if the attempt was successful (1) or unsuccessful (0) I have to compute how many times a person has tried to breach the network UP TO their first successful breach of type X and/or Y.
To illustrate I have to do my analysis as follows:
I have to organize by date in ascending order then count how many times an ID tried to breach the network until this ID had a successful breach and whether it was of type X or type Y such as:
ID DATE SUCCESS TYPE
007 1/01/2021 0 X
007 1/01/2021 0 X
007 1/01/2021 1 X
007 2/01/2021 0 X
007 3/01/2021 1 Y
In this table we see that agent 007 made chronological 3 attempts before his FIRST successfully type X breach of the network, and that it only took him one attempt to do a type Y breach of the network
I have been hitting my head against a wall trying to come up with a code that would let me do that calculation for the following dataset:
library(tidyverse)
dataset<- tibble::tribble(
~DATE, ~ID, ~SUCCESS, ~TYPE,
"19/04/2021", "52952", "0", "X",
"19/04/2021", "46102", "0", "X",
"20/04/2021", "40005", "1", "X",
"20/04/2021", "29872", "0", "X",
"21/04/2021", "52952", "0", "X",
"21/04/2021", "46102", "1", "Y",
"22/04/2021", "46102", "0", "X",
"22/04/2021", "46102", "1", "Y",
"23/04/2021", "46102", "0", "Y",
"23/04/2021", "46102", "1", "Y",
"24/04/2021", "46102", "0", "Y",
"24/04/2021", "40005", "0", "Y",
"25/04/2021", "52952", "0", "Y",
"25/04/2021", "46102", "1", "Y",
"26/04/2021", "40005", "0", "X",
"26/04/2021", "52952", "1", "X",
"27/04/2021", "46102", "1", "X",
"27/04/2021", "40005", "1", "X",
"28/04/2021", "52952", "1", "X",
"28/04/2021", "52952", "0", "X",
)
What have I done?
I first started by grouping the ID and DATES in order to select the very first attempt of type X and Y that way I know when was the very first attempt giving that this is Zero, and then I have been figuring out a way to select for each ID the "nearest" successful attempt without any luck as I always end up selecting the very last successful attempt but this excessive only calls for the very first attempt... Also, I have to be careful when selecting the first event because there may be cases where the very first attempt is a success.
Upvotes: 0
Views: 49
Reputation: 30474
If you group_by
both ID
and TYPE
and have them arranged by DATE
, you can get the index of the first row where SUCCESS
is equal to 1.
Note - you'll have warnings and Inf
for events where there is no SUCCESS
of 1 for a given ID
and TYPE
.
library(tidyverse)
dataset %>%
mutate(DATE = as.Date(DATE, format = "%d/%m/%Y")) %>%
group_by(ID, TYPE) %>%
arrange(ID, TYPE, DATE) %>%
summarise(EVENTS = min(which(SUCCESS == 1)))
Both min
and which
are base R. Consider the following data subset for 52952 and type "X":
2021-04-19 52952 0 X
2021-04-21 52952 0 X
2021-04-26 52952 1 X
2021-04-28 52952 1 X
2021-04-28 52952 0 X
The values for success are 0, 0, 1, 1, 0 when ordered by date.
If you have a vector of those values, which
will tell you the index or position, where SUCCESS
is equal to 1. To demonstrate, if I create a vector vec
with those same values:
vec <- c(0, 0, 1, 1, 0)
vec
[1] 0 0 1 1 0
which(vec == 1)
[1] 3 4
then which
tells me that indexes/positions 3 and 4 (third and fourth value) satisfy the condition of equaling 1.
Then, min
is added to take the minimum value. So, for the result from which
(3 and 4), the minimum value is 3. It follows then that 3 events are needed to reach the first SUCCESS
value that equals 1.
Output
# A tibble: 7 x 3
# Groups: ID [4]
ID TYPE EVENTS
<chr> <chr> <dbl>
1 29872 X Inf
2 40005 X 1
3 40005 Y Inf
4 46102 X 3
5 46102 Y 1
6 52952 X 3
7 52952 Y Inf
Upvotes: 2
Reputation: 655
dataset$DATE <- lubridate::dmy(dataset$DATE)
dataset %>%
arrange(ID, TYPE, DATE) %>%
group_by(ID, TYPE) %>%
# creating variable to denote when success status changes
mutate(RLEID = rleid(SUCCESS == 0)) %>%
# keeping the row if the attempt was unsuccessful (SUCCESS == 0) and before a successful attempt (RLEID == 1)
filter(SUCCESS == 0 & RLEID == 1) %>%
# finding the counts
summarise(n = n())
Some people (ID = 29872 TYPE = X, ID = 40005 TYPE = Y, ID = 52952 TYPE = Y) never had a successful attempt. Did you still want to include them?
Upvotes: 1