jjulip
jjulip

Reputation: 1133

Determine rows in a dataframe where the next row has an identical character value in R

I have some data with a factor variable (either apples or bananas) and I want to be able to identify places in my dataset where the value is one of these two options in two consecutive rows (i.e. rows 4&5 below for apples and rows 8&9 below for bananas). I know that the duplicated function will be useful here (i.e. Index out the subsequent row with an identical value in R), but I am not sure how to go about achieving my desired output with categorical variables.

Example data:

  test =  structure(list(cnt = c(87L, 51L, 24L, 69L, 210L, 21L, 15L, 9L, 
    12L), type = c("apple", "banana", "apple", "banana", "banana", 
    "apple", "banana", "apple", "apple")), .Names = c("cnt", "type"
    ), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
    -9L), spec = structure(list(cols = structure(list(cnt = structure(list(), class = c("collector_integer", 
    "collector")), type = structure(list(), class = c("collector_character", 
    "collector"))), .Names = c("cnt", "type")), default = structure(list(), class = c("collector_guess", 
    "collector"))), .Names = c("cols", "default"), class = "col_spec"))

Desired output:

    cnt   type  output
1    87  apple FALSE
2    51 banana FALSE
3    24  apple FALSE
4    69 banana TRUE
5   210 banana TRUE
6    21  apple FALSE
7    15 banana FALSE
8     9  apple TRUE
9    12  apple TRUE

When I use the following code I just get a summary that tells me that both apples and bananas are duplicated!:

test[!duplicated(test[,"type], fromLast=TRUE,]

Any help would be much appreciated.

Upvotes: 1

Views: 338

Answers (2)

mt1022
mt1022

Reputation: 17289

We can try run length encoding:

x <- rle(test$type)
x$values <- ifelse(x$lengths == 2, TRUE, FALSE)

test$output <- inverse.rle(x)
# > test
#   cnt   type output
# 1  87  apple  FALSE
# 2  51 banana  FALSE
# 3  24  apple  FALSE
# 4  69 banana   TRUE
# 5 210 banana   TRUE
# 6  21  apple  FALSE
# 7  15 banana  FALSE
# 8   9  apple   TRUE
# 9  12  apple   TRUE

Upvotes: 2

akrun
akrun

Reputation: 887108

We can do this in multiple ways. One option is rleid from data.table to create a grouping variable based on the adjacenet elements that are same, and then create the 'output' column by assigning (:=) the output of the logical condition ie. if the number of elements are greater than 1 (.N >1)

library(data.table)
setDT(test)[, output := .N>1, rleid(type)]
test
#   cnt   type output
#1:  87  apple  FALSE
#2:  51 banana  FALSE
#3:  24  apple  FALSE
#4:  69 banana   TRUE
#5: 210 banana   TRUE
#6:  21  apple  FALSE
#7:  15 banana  FALSE
#8:   9  apple   TRUE
#9:  12  apple   TRUE

Based on the OP's description, one option with tidyverse would be

library(tidyverse)
test %>% 
    mutate(output = (type == lead(type, default = type[n()-1]))|
                     type == lag(type, default = type[2]))

Upvotes: 2

Related Questions