Reputation: 1133
I have some data with a factor variable (either apples or bananas) and I want to be able to identify places in my dataset where the value is one of these two options in two consecutive rows (i.e. rows 4&5 below for apples and rows 8&9 below for bananas). I know that the duplicated function will be useful here (i.e. Index out the subsequent row with an identical value in R), but I am not sure how to go about achieving my desired output with categorical variables.
Example data:
test = structure(list(cnt = c(87L, 51L, 24L, 69L, 210L, 21L, 15L, 9L,
12L), type = c("apple", "banana", "apple", "banana", "banana",
"apple", "banana", "apple", "apple")), .Names = c("cnt", "type"
), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-9L), spec = structure(list(cols = structure(list(cnt = structure(list(), class = c("collector_integer",
"collector")), type = structure(list(), class = c("collector_character",
"collector"))), .Names = c("cnt", "type")), default = structure(list(), class = c("collector_guess",
"collector"))), .Names = c("cols", "default"), class = "col_spec"))
Desired output:
cnt type output
1 87 apple FALSE
2 51 banana FALSE
3 24 apple FALSE
4 69 banana TRUE
5 210 banana TRUE
6 21 apple FALSE
7 15 banana FALSE
8 9 apple TRUE
9 12 apple TRUE
When I use the following code I just get a summary that tells me that both apples and bananas are duplicated!:
test[!duplicated(test[,"type], fromLast=TRUE,]
Any help would be much appreciated.
Upvotes: 1
Views: 338
Reputation: 17289
We can try run length encoding:
x <- rle(test$type)
x$values <- ifelse(x$lengths == 2, TRUE, FALSE)
test$output <- inverse.rle(x)
# > test
# cnt type output
# 1 87 apple FALSE
# 2 51 banana FALSE
# 3 24 apple FALSE
# 4 69 banana TRUE
# 5 210 banana TRUE
# 6 21 apple FALSE
# 7 15 banana FALSE
# 8 9 apple TRUE
# 9 12 apple TRUE
Upvotes: 2
Reputation: 887108
We can do this in multiple ways. One option is rleid
from data.table
to create a grouping variable based on the adjacenet elements that are same, and then create the 'output' column by assigning (:=
) the output of the logical condition ie. if the number of elements are greater than 1 (.N >1
)
library(data.table)
setDT(test)[, output := .N>1, rleid(type)]
test
# cnt type output
#1: 87 apple FALSE
#2: 51 banana FALSE
#3: 24 apple FALSE
#4: 69 banana TRUE
#5: 210 banana TRUE
#6: 21 apple FALSE
#7: 15 banana FALSE
#8: 9 apple TRUE
#9: 12 apple TRUE
Based on the OP's description, one option with tidyverse
would be
library(tidyverse)
test %>%
mutate(output = (type == lead(type, default = type[n()-1]))|
type == lag(type, default = type[2]))
Upvotes: 2