mrbubu
mrbubu

Reputation: 484

How to compare values from current and previous rows based on other values in data frame without loops in R

I have a log file converted to data frame with data like this:

Client  Date    Group   Count   Test
C1  1.437512e+12    6   2   0
C1  1.437685e+12    3   1   0
C2  1.437685e+12    3   1   0
C2  1.437772e+12    6   1   0
C2  1.438117e+12    6   2   0
C2  1.464037e+12    3   3   0
C3  1.448662e+12    6   2   0
C3  1.451081e+12    3   5   0
C4  1.437944e+12    6   1   0
C4  1.438895e+12    3   6   0
C5  1.460581e+12    3   2   0
C5  1.460668e+12    6   2   0
C5  1.460927e+12    6   1   0
C5  1.461013e+12    3   2   0
C6  1.437685e+12    3   1   0
C6  1.437944e+12    6   1   0
C6  1.447711e+12    3   2   0
C6  1.458079e+12    3   2   0
C7  1.463000e+12    3   5   0
C7  1.463000e+12    10  1   0
C8  1.463951e+12    6   5   0
C8  1.463951e+12    3   1   0
C9  1.463346e+12    3   5   0
C9  1.464037e+12    10  1   0
C10 1.459804e+12    3   2   0
C10 1.461272e+12    3   1   0
C10 1.461877e+12    5   1   0
C10 1.462223e+12    5   1   0
C10 1.462482e+12    5   1   0

Client is ID of a person in data base. Group is a category of this person in a current moment of time (rows are ordered from latest to newest). Count is a count of events made by a user in that day. Test is a column that I want to fill with values.

My main goal is to get for every user only those rows form this data frame, that has Group=3 before its' first change to some number, that is not equal to 3. I want to mark this rows in Test column and then sum Count values of this rows by each user.

I try to solve this problem with a for loop:

for (i in 2:length(log$Group)){
 if (log$Client[i-1] == log$Client[i]) {
  if ((log$Group[i-1] == 3) & (log$Group[i] != 3)) {
   log$Test[i] <- NA
  }
  if ((log$Group[i-1] != 3) & (log$Group[i] == 3)) {
   log$Test[i] <- NA
  }
  if ((log$Group[i-1] == 3) & (log$Group[i] == 3)) {
   if(is.na(log$Test[i-1])) {
    log$Test[i] <- NA
   }
  }
  if ((log$Group[i-1] != 3) & (log$Group[i] != 3)) {
   log$Test[i] <- NA
   log$Test[i-1] <- NA
  }
 }
}

But it take ages to run this loop on full data. Then I'll just delete all rows with NAs.

The result will be like this:

Client  Date    Group   Count   Test
C2  1.437685e+12    3   1   1
C5  1.460581e+12    3   2   1
C6  1.437685e+12    3   1   1
C7  1.463000e+12    3   5   1
C9  1.463346e+12    3   5   1
C10 1.459804e+12    3   2   1
C10 1.461272e+12    3   1   1

Is there any way to do all of this without loops? Or maybe is there better way to solve entire problem?

Upvotes: 0

Views: 1085

Answers (1)

Luis Usier
Luis Usier

Reputation: 131

If I understood your question correctly (and it is very difficult to understand), you should do something like the following (using package dplyr):

eventsToAnalyze %>%
    filter(group == 3) %>%
    group_by(client) %>%
    summarize(total = sum(Count))

This will filter out the rows not in group 3 and sum the values in the Count column by client.

EDIT

Ah, now that you've clarified your question, I see the problem. This should work if 3 is the lowest value:

eventsToAnalyze %>%
    group_by(client) %>%
    filter(cummax(group) <= 3) %>%
    summarize(total = sum(Count))

Otherwise you can use the messier but more general:

eventsToAnalyze %>%
    group_by(client) %>%
    filter(cummax(abs(group - 3)) <= 0) %>%
    summarize(total = sum(Count))

Upvotes: 1

Related Questions