Reputation: 484
I have a log file converted to data frame with data like this:
Client Date Group Count Test
C1 1.437512e+12 6 2 0
C1 1.437685e+12 3 1 0
C2 1.437685e+12 3 1 0
C2 1.437772e+12 6 1 0
C2 1.438117e+12 6 2 0
C2 1.464037e+12 3 3 0
C3 1.448662e+12 6 2 0
C3 1.451081e+12 3 5 0
C4 1.437944e+12 6 1 0
C4 1.438895e+12 3 6 0
C5 1.460581e+12 3 2 0
C5 1.460668e+12 6 2 0
C5 1.460927e+12 6 1 0
C5 1.461013e+12 3 2 0
C6 1.437685e+12 3 1 0
C6 1.437944e+12 6 1 0
C6 1.447711e+12 3 2 0
C6 1.458079e+12 3 2 0
C7 1.463000e+12 3 5 0
C7 1.463000e+12 10 1 0
C8 1.463951e+12 6 5 0
C8 1.463951e+12 3 1 0
C9 1.463346e+12 3 5 0
C9 1.464037e+12 10 1 0
C10 1.459804e+12 3 2 0
C10 1.461272e+12 3 1 0
C10 1.461877e+12 5 1 0
C10 1.462223e+12 5 1 0
C10 1.462482e+12 5 1 0
Client is ID of a person in data base. Group is a category of this person in a current moment of time (rows are ordered from latest to newest). Count is a count of events made by a user in that day. Test is a column that I want to fill with values.
My main goal is to get for every user only those rows form this data frame, that has Group=3 before its' first change to some number, that is not equal to 3. I want to mark this rows in Test column and then sum Count values of this rows by each user.
I try to solve this problem with a for loop:
for (i in 2:length(log$Group)){
if (log$Client[i-1] == log$Client[i]) {
if ((log$Group[i-1] == 3) & (log$Group[i] != 3)) {
log$Test[i] <- NA
}
if ((log$Group[i-1] != 3) & (log$Group[i] == 3)) {
log$Test[i] <- NA
}
if ((log$Group[i-1] == 3) & (log$Group[i] == 3)) {
if(is.na(log$Test[i-1])) {
log$Test[i] <- NA
}
}
if ((log$Group[i-1] != 3) & (log$Group[i] != 3)) {
log$Test[i] <- NA
log$Test[i-1] <- NA
}
}
}
But it take ages to run this loop on full data. Then I'll just delete all rows with NAs.
The result will be like this:
Client Date Group Count Test
C2 1.437685e+12 3 1 1
C5 1.460581e+12 3 2 1
C6 1.437685e+12 3 1 1
C7 1.463000e+12 3 5 1
C9 1.463346e+12 3 5 1
C10 1.459804e+12 3 2 1
C10 1.461272e+12 3 1 1
Is there any way to do all of this without loops? Or maybe is there better way to solve entire problem?
Upvotes: 0
Views: 1085
Reputation: 131
If I understood your question correctly (and it is very difficult to understand), you should do something like the following (using package dplyr
):
eventsToAnalyze %>%
filter(group == 3) %>%
group_by(client) %>%
summarize(total = sum(Count))
This will filter out the rows not in group 3 and sum the values in the Count
column by client.
EDIT
Ah, now that you've clarified your question, I see the problem. This should work if 3 is the lowest value:
eventsToAnalyze %>%
group_by(client) %>%
filter(cummax(group) <= 3) %>%
summarize(total = sum(Count))
Otherwise you can use the messier but more general:
eventsToAnalyze %>%
group_by(client) %>%
filter(cummax(abs(group - 3)) <= 0) %>%
summarize(total = sum(Count))
Upvotes: 1