Reputation: 31
Within a dataframe in R, I'm trying to detect whether 5 consecutive observations are identical in a pseudorandom design.
In this design, two classes of objects are presented 5 times (A objects and B objects) and participants make a rating of them (A1 - B5). The ordering of each object is contained in columns as well (A1.order - B5.order). What I'd like to know is whether 5 consecutive observations (can be the first 5, middle 5, last 5, etc.) are identical. The typical functions for detecting whether columns are identical don't work here because I am concerned about the presentation order of the objects.
I'd like for a new column called "identical" to be added to the dataframe which will give me a TRUE or FALSE as to whether there are 5 identical consecutive observations.
I'd also like for those identical observations to be coded as "NA"
df <- structure(list(ID = c(101, 102, 103, 104, 105, 106, 107), A1 = c(1,
4, 1, 3, 4, 5, 3), A2 = c(1, 3, 1, 1, 5, 3, 2), A3 = c(1, 1,
1, 5, 4, 3, 2), A4 = c(1, 3, 2, 1, 3, 2, 1), A5 = c(3, 1, 2,
3, 5, 4, 5), B1 = c(3, 2, 1, 5, 1, 3, 2), B2 = c(1, 2, 1, 4,
2, 4, 5), B3 = c(1, 2, 3, 2, 5, 4, 3), B4 = c(2, 3, 1, 4, 5,
2, 3), B5 = c(2, 2, 2, 2, 2, 2, 2), A1.order = c(1, 2, 1, 2,
1, 2, 1), A2.order = c(3, 4, 3, 4, 3, 4, 3), A3.order = c(5,
6, 5, 6, 5, 6, 5), A4.order = c(7, 8, 7, 8, 7, 8, 7), A5.order = c(10,
9, 10, 9, 10, 9, 10), B1.order = c(2, 1, 2, 1, 2, 1, 2), B2.order = c(4,
3, 4, 3, 4, 3, 4), B3.order = c(6, 5, 6, 5, 6, 5, 6), B4.order = c(8,
7, 8, 7, 8, 7, 8), B5.order = c(9, 10, 9, 10, 9, 10, 9)), row.names = c(NA,
7L), class = "data.frame")
df
#> ID A1 A2 A3 A4 A5 B1 B2 B3 B4 B5 A1.order A2.order A3.order A4.order
#> 1 101 1 2 3 4 3 3 1 1 2 2 1 3 5 7
#> 2 102 4 3 1 3 1 2 2 2 3 2 2 4 6 8
#> 3 103 2 4 3 2 2 4 3 3 1 2 1 3 5 7
#> 4 104 3 1 5 1 3 5 4 2 4 2 2 4 6 8
#> 5 105 4 5 4 3 5 1 2 5 5 2 1 3 5 7
#> 6 106 5 3 3 2 4 3 4 4 2 2 2 4 6 8
#> 7 107 3 2 2 1 5 2 5 3 3 2 1 3 5 7
#> A5.order B1.order B2.order B3.order B4.order B5.order
#> 1 10 2 4 6 8 9
#> 2 9 1 3 5 7 10
#> 3 10 2 4 6 8 9
#> 4 9 1 3 5 7 10
#> 5 10 2 4 6 8 9
#> 6 9 1 3 5 7 10
#> 7 10 2 4 6 8 9
Here is the output I want:
> df2
ID A1 A2 A3 A4 A5 B1 B2 B3 B4 B5 A1.order A2.order A3.order A4.order A5.order B1.order B2.order
1 101 3 NA NA NA 3 3 NA NA 2 2 1 3 5 7 10 2 4
2 102 4 3 1 3 1 2 2 2 3 2 2 4 6 8 9 1 3
3 103 1 1 1 2 2 1 1 3 1 2 1 3 5 7 10 2 4
4 104 3 1 5 1 3 5 4 2 4 2 2 4 6 8 9 1 3
5 105 4 5 4 3 5 1 2 5 5 2 1 3 5 7 10 2 4
6 106 5 3 3 2 4 3 4 4 2 2 2 4 6 8 9 1 3
7 107 3 2 2 1 5 2 5 3 3 2 1 3 5 7 10 2 4
B3.order B4.order B5.order identical
1 6 8 9 TRUE
2 5 7 10 FALSE
3 6 8 9 FALSE
4 5 7 10 FALSE
5 6 8 9 FALSE
6 5 7 10 FALSE
7 6 8 9 FALSE
Upvotes: 1
Views: 96
Reputation: 51
Based on how I understand your question - I think the following code addresses your question. The big issue here is the data structure - data.table has some nice restructuring methods that will help. like what @David is saying base R rle will help capture running streaks
library(data.table)
running_value = function(input, value){
run_len = rle(test$input)
inverse.rle(list(lengths = run_len$lengths,values = run_len$lengths >= value))
}
setDT(df)
order_dt = df[,c('ID',grep('order', names(df),value = T)), with = F]
order_dt = melt(order_dt, id.vars = 'ID', value.name = 'order')
order_dt[,variable := gsub('.order', '', variable)]
response_dt = df[,c(names(df)[!grepl('order',names(df))]), with = F]
response_dt = melt(response_dt, id.vars = 'ID', value.name = 'input')
order_response = merge(order_dt, response_dt, by = c('ID', 'variable'))
setorder(order_response, ID, order)
#detailed way to look at it
order_response[,identical := running_value(input, value = 5), by = c('ID')]
#matching the format you gave
order_response_sum = order_response[,.(identical = any(identical)), by = c('ID')]
merge(df, order_response_sum, by = 'ID')
Upvotes: 0
Reputation: 1074
The rle
function calculates streaks across a vector. You can (1) create a detect_streaks
function that returns TRUE/FALSE; and (2) loop through the rows, checking for consecutive streaks.
detect_streak <- function(x) {
streaks <- rle(x)$lengths
return(any(streaks >= 5))
}
df$identical <- vector(mode = "logical", length = nrow(df))
for (i in 1:nrow(df)) {
df$identical[i] <- detect_streak(df[i,2:11])
}
In the data you've provided, there are no streaks between A1:B5.
Upvotes: 1