Reputation: 10199
I have dataset consists of user
,time
and condition
. I want to replace the time for sequences beginning with a FALSE followed more than two consecutive TRUE
s with the time
of the last consecutive TRUE
.
let's say df:
df <- read.csv(text="user,time,condition
11,1:05,FALSE
11,1:10,TRUE
11,1:10,FALSE
11,1:15,TRUE
11,1:20,TRUE
11,1:25,TRUE
11,1:40,FALSE
22,2:20,FALSE
22,2:30,FALSE
22,2:35,TRUE
22,2:40,TRUE", header=TRUE)
my desired result: time of rownumber 6 is copied to time of rownumber 3 to 6 because consecutive TRUE
starts from 4 to 6. the same applies to last three records.
user time condition
11 1:05 FALSE
11 1:10 TRUE
11 1:25 FALSE
11 1:25 TRUE
11 1:25 TRUE
11 1:25 TRUE
11 1:40 FALSE
22 2:20 FALSE
22 2:40 FALSE
22 2:40 TRUE
22 2:40 TRUE
how can I do it in R?
Upvotes: 4
Views: 329
Reputation: 25225
Here is a data.table
solution which should be faster in runtime.
library(data.table)
setDT(df)
df[, time := if (.N > 2) time[.N] else time,
by=cumsum(!shift(c(condition, FALSE))[-1L])]
# user time condition
# 1: 11 1:05 FALSE
# 2: 11 1:10 TRUE
# 3: 11 1:25 FALSE
# 4: 11 1:25 TRUE
# 5: 11 1:25 TRUE
# 6: 11 1:25 TRUE
# 7: 11 1:40 FALSE
# 8: 22 2:20 FALSE
# 9: 22 2:40 FALSE
#10: 22 2:40 TRUE
#11: 22 2:40 TRUE
Idea is to cut into sequences of starting with F.
[-1L]
removes the first NA before doing a cumsum
.
I would recommend that you run some of the by
code within j
to take a look.
data:
df <- read.csv(text="user,time,condition
11,1:05,FALSE
11,1:10,TRUE
11,1:10,FALSE
11,1:15,TRUE
11,1:20,TRUE
11,1:25,TRUE
11,1:40,FALSE
22,2:20,FALSE
22,2:30,FALSE
22,2:35,TRUE
22,2:40,TRUE", header=TRUE)
Upvotes: 0
Reputation: 649
Here's one option using rle
## Run length encoding of df
df_rle <- rle(df$condition)
## Locations of 2 or more consecutive TRUEs in RLE
seq_changes <- which(df_rle$lengths >= 2 & df_rle$value == TRUE)
## End-point index in original data frame
df_ind <- cumsum(df_rle$lengths)
## Loop over breakpoints to change
for (i in seq_changes){
i1 <- df_ind[i-1]
i2 <- df_ind[i]
df$time[i1:i2] <- df$time[i2]
}
Upvotes: 4
Reputation: 1162
This solution should do the trick, see the comments in the code for more details
false_positions <- which(!c(df$condition, FALSE)) #Flag the position of each of the FALSE occurences
#A dummy FALSE is put on the end to check for end of dataframe
false_differences <- diff(false_positions, 1) #Calculate how far each FALSE occurence is from the last
false_starts <- which(false_differences > 2) #Grab which of these FALSE differences are more than 2 apart
#Greater than 2 indicates 2 or more TRUEs as the first FALSE
#counts as one position
#false_starts stores the beginning of each chain we want to update
#Go through each of the FALSE starts which have more than one consecutive TRUE
for(false_start in false_starts){
false_first <- false_positions[false_start] #Gets the position of the start of our chain
true_last <- false_positions[false_start+1]-1 #Gets the position of the end of our chain, which is the
#the item before (thus the -1) the false after our
#initial FALSE (thus the +1)
time_override <- df$time[true_last] #Now we know the position of the end of our chain (the last TRUE)
#We can get the time we want to use
df$time[false_first:true_last] <- time_override #Update all the times from the start to end of our chain with
#the time we just determined
}
> df
user time condition
1 11 1:05 FALSE
2 11 1:10 TRUE
3 11 1:25 FALSE
4 11 1:25 TRUE
5 11 1:25 TRUE
6 11 1:25 TRUE
7 11 1:40 FALSE
8 22 2:20 FALSE
9 22 2:40 FALSE
10 22 2:40 TRUE
11 22 2:40 TRUE
I would like to parallelize that bottom loop if possible, but off the top of my head I was struggling to do so.
The gist is to identify where all our falses are, then identify where the start of all our chains are, since we only have TRUEs and FALSEs we can do this by looking at how far apart our FALSEs are!
Once we know where our chains start (since they are the first FALSE where FALSEs are far enough apart) we can get the end of our chain by looking at the element before the next FALSE in the list of all FALSES we already created.
Now we have the beginning and end of our chains we can just look at the end of the chain to get the time we want, then fill out the time values!
I hope this presents a relatively fast way of doing what you want, though :)
Upvotes: 1