Cina
Cina

Reputation: 10199

looping and check condition of consecutive records and replacing in R

I have dataset consists of user,time and condition. I want to replace the time for sequences beginning with a FALSE followed more than two consecutive TRUEs with the time of the last consecutive TRUE.

let's say df:

df <- read.csv(text="user,time,condition
11,1:05,FALSE
11,1:10,TRUE
11,1:10,FALSE
11,1:15,TRUE
11,1:20,TRUE
11,1:25,TRUE
11,1:40,FALSE
22,2:20,FALSE
22,2:30,FALSE
22,2:35,TRUE
22,2:40,TRUE", header=TRUE)

my desired result: time of rownumber 6 is copied to time of rownumber 3 to 6 because consecutive TRUE starts from 4 to 6. the same applies to last three records.

user time   condition
11  1:05    FALSE
11  1:10    TRUE
11  1:25    FALSE
11  1:25    TRUE
11  1:25    TRUE
11  1:25    TRUE
11  1:40    FALSE
22  2:20    FALSE
22  2:40    FALSE
22  2:40    TRUE
22  2:40    TRUE

how can I do it in R?

Upvotes: 4

Views: 329

Answers (3)

chinsoon12
chinsoon12

Reputation: 25225

Here is a data.table solution which should be faster in runtime.

library(data.table)
setDT(df)
df[, time := if (.N > 2) time[.N] else time, 
    by=cumsum(!shift(c(condition, FALSE))[-1L])]

#    user time condition
# 1:   11 1:05     FALSE
# 2:   11 1:10      TRUE
# 3:   11 1:25     FALSE
# 4:   11 1:25      TRUE
# 5:   11 1:25      TRUE
# 6:   11 1:25      TRUE
# 7:   11 1:40     FALSE
# 8:   22 2:20     FALSE
# 9:   22 2:40     FALSE
#10:   22 2:40      TRUE
#11:   22 2:40      TRUE

Idea is to cut into sequences of starting with F.

[-1L] removes the first NA before doing a cumsum.

I would recommend that you run some of the by code within j to take a look.

data:

df <- read.csv(text="user,time,condition
11,1:05,FALSE
11,1:10,TRUE
11,1:10,FALSE
11,1:15,TRUE
11,1:20,TRUE
11,1:25,TRUE
11,1:40,FALSE
22,2:20,FALSE
22,2:30,FALSE
22,2:35,TRUE
22,2:40,TRUE", header=TRUE)

Upvotes: 0

Gabe
Gabe

Reputation: 649

Here's one option using rle

## Run length encoding of df
df_rle <- rle(df$condition)
## Locations of 2 or more consecutive TRUEs in RLE
seq_changes <- which(df_rle$lengths >= 2 & df_rle$value == TRUE)
## End-point index in original data frame
df_ind <- cumsum(df_rle$lengths)

## Loop over breakpoints to change
for (i in seq_changes){
  i1 <- df_ind[i-1]
  i2 <- df_ind[i]
  df$time[i1:i2] <- df$time[i2]
}

Upvotes: 4

LachlanO
LachlanO

Reputation: 1162

This solution should do the trick, see the comments in the code for more details

false_positions <- which(!c(df$condition, FALSE)) #Flag the position of each of the FALSE occurences
                                                  #A dummy FALSE is put on the end to check for end of dataframe

false_differences <- diff(false_positions, 1)     #Calculate how far each FALSE occurence is from the last

false_starts <- which(false_differences > 2)      #Grab which of these FALSE differences are more than 2 apart
                                                  #Greater than 2 indicates 2 or more TRUEs as the first FALSE 
                                                  #counts as one position

#false_starts stores the beginning of each chain we want to update

#Go through each of the FALSE starts which have more than one consecutive TRUE
for(false_start in false_starts){

  false_first <- false_positions[false_start]     #Gets the position of the start of our chain

  true_last <- false_positions[false_start+1]-1   #Gets the position of the end of our chain, which is the
                                                  #the item before (thus the -1) the false after our
                                                  #initial FALSE (thus the +1)

  time_override <- df$time[true_last]             #Now we know the position of the end of our chain (the last TRUE)
                                                  #We can get the time we want to use

  df$time[false_first:true_last] <- time_override #Update all the times from the start to end of our chain with
                                                  #the time we just determined

}

> df
   user time condition
1    11 1:05     FALSE
2    11 1:10      TRUE
3    11 1:25     FALSE
4    11 1:25      TRUE
5    11 1:25      TRUE
6    11 1:25      TRUE
7    11 1:40     FALSE
8    22 2:20     FALSE
9    22 2:40     FALSE
10   22 2:40      TRUE
11   22 2:40      TRUE

I would like to parallelize that bottom loop if possible, but off the top of my head I was struggling to do so.

The gist is to identify where all our falses are, then identify where the start of all our chains are, since we only have TRUEs and FALSEs we can do this by looking at how far apart our FALSEs are!

Once we know where our chains start (since they are the first FALSE where FALSEs are far enough apart) we can get the end of our chain by looking at the element before the next FALSE in the list of all FALSES we already created.

Now we have the beginning and end of our chains we can just look at the end of the chain to get the time we want, then fill out the time values!

I hope this presents a relatively fast way of doing what you want, though :)

Upvotes: 1

Related Questions