Reputation: 65
I did not find any method of checking whether categorical value elements of a vector are between other categorical value elements. A dataframe is given:
id letter
1 B
2 A
3 B
4 B
5 C
6 B
7 A
8 B
9 C
Everything I found is related to numerical values and to the notion of general order (rather than to index of an element in a specific vector).
I want to add a new column with boolean values (1 if B is between A and C; 0 if B is between C and A) to the dataframe,
id letter between
1 B 0
2 A NA
3 B 1
4 B 1
5 C NA
6 B 0
7 A NA
8 B 1
9 C NA
Upvotes: 2
Views: 149
Reputation: 46886
It's unclear from the question whether "A" and "C" must alternate, though that's implied because there is no coding for "B" between "A" and "A" or vv. Supposing that they do, for the vector
x = c("B", "A", "B", "B", "C", "B", "A", "B", "C")
map to numeric values c(A=1, B=0, C=-1)
and form the cumulative sum
v = cumsum(c(A=1, B=0, C=-1)[x])
(increment by 1 when encountering "A", decrement by one when "C"). Replace positions not corresponding to "B" with NA
v[x != "B"] = NA
giving
> v
B A B B C B A B C
0 NA 1 1 NA 0 NA 1 NA
This could be captured as a function
fun = function(x, map = c(A = 1, B = 0, C = -1)) {
x = map[x]
v = cumsum(x)
v[x != 0] = NA
v
}
and used to transform a data.frame or tibble, e.g.,
tibble(x) %>% mutate(v = fun(x))
Upvotes: 1
Reputation: 624
Here's one solution, which I hope is fairly easy conceptually. For 'special' cases such as B being at the top or bottom of the list, or having an A or a C on both sides, I've set such values to 0.
# Create dummy data - you use your own
df <- data.frame(id=1:100, letter=sample(c("A", "B", "C"), 100, replace=T))
# Copy down info on whether A or C is above each B
acup <- df$letter
for(i in 2:nrow(df))
if(df$letter[i] == "B")
acup[i] <- acup[i-1]
# Copy up info on whether A or C is below each B
acdown <- df$letter
for(i in nrow(df):2 -1)
if(df$letter[i] == "B")
acdown[i] <- acdown[i+1]
# Set appropriate values for column 'between'
df$between <- NA
df$between[acup == "A" & acdown == "C"] <- 1
df$between[df$letter == "B" & is.na(df$between)] <- 0 # Includes special cases
Upvotes: 0
Reputation: 40171
A different tidyverse
possibility could be:
df %>%
group_by(grp = with(rle(letter), rep(seq_along(lengths), lengths))) %>%
filter(row_number() == 1) %>%
ungroup() %>%
mutate(res = ifelse(lag(letter, default = first(letter)) == "A" &
lead(letter, default = last(letter)) == "C", 1, 0)) %>%
select(-letter, -grp) %>%
full_join(df, by = c("id" = "id")) %>%
arrange(id) %>%
fill(res) %>%
mutate(res = ifelse(letter != "B", NA, res))
id res letter
<int> <dbl> <chr>
1 1 0 B
2 2 NA A
3 3 1 B
4 4 1 B
5 5 NA C
6 6 0 B
7 7 NA A
8 8 1 B
9 9 NA C
In this case it, first, groups by a run-length type ID and keeps the first rows with a given ID. Second, it checks the condition. Third, it performs a full join with the original df on "id" column. Finally, it arranges according "id", fills the missing values and assigns NA to rows where "letter" != B.
Upvotes: 1
Reputation: 3183
You can use lead
and lag
functions to know the letters before and after and then mutate
as below:
library(dplyr)
df %>%
mutate(letter_lag = lag(letter, 1),
letter_lead = lead(letter, 1)) %>%
mutate(between = case_when(letter_lag == "A" | letter_lead == "C" ~ 1,
letter_lag == "C" | letter_lead == "A" ~ 0,
TRUE ~ NA_real_)) %>%
select(id, letter, between)
id letter between
1 1 B 0
2 2 A NA
3 3 B 1
4 4 B 1
5 5 C NA
6 6 B 0
7 7 A NA
8 8 B 1
9 9 C NA
Upvotes: -1
Reputation: 20409
A combination of rle
(run length encoding) and zoo::rollapply
is one option:
library(zoo)
d <- structure(list(id = 1:9,
letter = structure(c(2L, 1L, 2L, 2L, 3L, 2L, 1L, 2L, 3L),
.Label = c("A", "B", "C"),
class = "factor")),
class = "data.frame", row.names = c(NA, -9L))
rl <- rle(as.numeric(d$letter))
rep(rollapply(c(NA, rl$values, NA),
3,
function(x) if (x[2] == 2)
ifelse(x[1] == 1 && x[3] == 3, 1, 0)
else NA),
rl$lengths)
# [1] 0 NA 1 1 NA 0 NA 1 NA
Explanation
rle
you identify blocks of consecutive values.rollapply
you "roll" a function with a given window size (here 3) over a vector.rl$values
contains the different elements and the function we apply to it is pretty straight forward:
B
) return NA
A
and element 3 is a C
return 1 and 0 otherwiseUpvotes: 1