in2td
in2td

Reputation: 31

Detecting identical observations across columns in R dataframe

Within a dataframe in R, I'm trying to detect whether 5 consecutive observations are identical in a pseudorandom design.

In this design, two classes of objects are presented 5 times (A objects and B objects) and participants make a rating of them (A1 - B5). The ordering of each object is contained in columns as well (A1.order - B5.order). What I'd like to know is whether 5 consecutive observations (can be the first 5, middle 5, last 5, etc.) are identical. The typical functions for detecting whether columns are identical don't work here because I am concerned about the presentation order of the objects.

I'd like for a new column called "identical" to be added to the dataframe which will give me a TRUE or FALSE as to whether there are 5 identical consecutive observations.

I'd also like for those identical observations to be coded as "NA"

df <- structure(list(ID = c(101, 102, 103, 104, 105, 106, 107), A1 = c(1, 
4, 1, 3, 4, 5, 3), A2 = c(1, 3, 1, 1, 5, 3, 2), A3 = c(1, 1, 
1, 5, 4, 3, 2), A4 = c(1, 3, 2, 1, 3, 2, 1), A5 = c(3, 1, 2, 
3, 5, 4, 5), B1 = c(3, 2, 1, 5, 1, 3, 2), B2 = c(1, 2, 1, 4, 
2, 4, 5), B3 = c(1, 2, 3, 2, 5, 4, 3), B4 = c(2, 3, 1, 4, 5, 
2, 3), B5 = c(2, 2, 2, 2, 2, 2, 2), A1.order = c(1, 2, 1, 2, 
1, 2, 1), A2.order = c(3, 4, 3, 4, 3, 4, 3), A3.order = c(5, 
6, 5, 6, 5, 6, 5), A4.order = c(7, 8, 7, 8, 7, 8, 7), A5.order = c(10, 
9, 10, 9, 10, 9, 10), B1.order = c(2, 1, 2, 1, 2, 1, 2), B2.order = c(4, 
3, 4, 3, 4, 3, 4), B3.order = c(6, 5, 6, 5, 6, 5, 6), B4.order = c(8, 
7, 8, 7, 8, 7, 8), B5.order = c(9, 10, 9, 10, 9, 10, 9)), row.names = c(NA, 
7L), class = "data.frame")

df

#>    ID A1 A2 A3 A4 A5 B1 B2 B3 B4 B5 A1.order A2.order A3.order A4.order
#> 1 101  1  2  3  4  3  3  1  1  2  2        1        3        5        7
#> 2 102  4  3  1  3  1  2  2  2  3  2        2        4        6        8
#> 3 103  2  4  3  2  2  4  3  3  1  2        1        3        5        7
#> 4 104  3  1  5  1  3  5  4  2  4  2        2        4        6        8
#> 5 105  4  5  4  3  5  1  2  5  5  2        1        3        5        7
#> 6 106  5  3  3  2  4  3  4  4  2  2        2        4        6        8
#> 7 107  3  2  2  1  5  2  5  3  3  2        1        3        5        7
#>   A5.order B1.order B2.order B3.order B4.order B5.order
#> 1       10        2        4        6        8        9
#> 2        9        1        3        5        7       10
#> 3       10        2        4        6        8        9
#> 4        9        1        3        5        7       10
#> 5       10        2        4        6        8        9
#> 6        9        1        3        5        7       10
#> 7       10        2        4        6        8        9

Here is the output I want:

> df2
   ID A1 A2 A3 A4 A5 B1 B2 B3 B4 B5 A1.order A2.order A3.order A4.order A5.order B1.order B2.order
1 101  3 NA NA NA  3  3 NA NA  2  2        1        3        5        7       10        2        4
2 102  4  3  1  3  1  2  2  2  3  2        2        4        6        8        9        1        3
3 103  1  1  1  2  2  1  1  3  1  2        1        3        5        7       10        2        4
4 104  3  1  5  1  3  5  4  2  4  2        2        4        6        8        9        1        3
5 105  4  5  4  3  5  1  2  5  5  2        1        3        5        7       10        2        4
6 106  5  3  3  2  4  3  4  4  2  2        2        4        6        8        9        1        3
7 107  3  2  2  1  5  2  5  3  3  2        1        3        5        7       10        2        4
  B3.order B4.order B5.order identical
1        6        8        9      TRUE
2        5        7       10     FALSE
3        6        8        9     FALSE
4        5        7       10     FALSE
5        6        8        9     FALSE
6        5        7       10     FALSE
7        6        8        9     FALSE

Upvotes: 1

Views: 96

Answers (2)

Bryan
Bryan

Reputation: 51

Based on how I understand your question - I think the following code addresses your question. The big issue here is the data structure - data.table has some nice restructuring methods that will help. like what @David is saying base R rle will help capture running streaks

library(data.table)

running_value = function(input, value){
  run_len = rle(test$input)
  inverse.rle(list(lengths = run_len$lengths,values = run_len$lengths >= value))
}

setDT(df)

order_dt = df[,c('ID',grep('order', names(df),value = T)), with = F]
order_dt = melt(order_dt, id.vars = 'ID', value.name = 'order')
order_dt[,variable := gsub('.order', '', variable)]

response_dt = df[,c(names(df)[!grepl('order',names(df))]), with = F]
response_dt = melt(response_dt, id.vars = 'ID', value.name = 'input')

order_response = merge(order_dt, response_dt, by = c('ID', 'variable'))
setorder(order_response, ID, order)

#detailed way to look at it
order_response[,identical := running_value(input, value = 5), by = c('ID')]

#matching the format you gave
order_response_sum = order_response[,.(identical = any(identical)), by = c('ID')]
merge(df, order_response_sum, by = 'ID')

Upvotes: 0

David Ranzolin
David Ranzolin

Reputation: 1074

The rle function calculates streaks across a vector. You can (1) create a detect_streaks function that returns TRUE/FALSE; and (2) loop through the rows, checking for consecutive streaks.

detect_streak <- function(x) {
  streaks <- rle(x)$lengths
  return(any(streaks >= 5))
}

df$identical <- vector(mode = "logical", length = nrow(df))
for (i in 1:nrow(df)) {
  df$identical[i] <- detect_streak(df[i,2:11])
}

In the data you've provided, there are no streaks between A1:B5.

Upvotes: 1

Related Questions