Reputation: 531
I have a dataframe with two variables (start
,end
). would like to create an identifier variable which grows in ascending order of start
and, most importantly, is kept constant if the value of start
coincides with end
of any other row in the dataframe.
Below is a simple example of the data
toy_data <- data.frame(start = c(1,5,6,10,16),
end = c(10,9,11,15,17))
The output I would be looking for is the following:
output_data <- data.frame(start = c(1,10,5,6,16),
end = c(10,15,9,11,17),
NEW_VAR = c(1,1,2,3,4))
Upvotes: 0
Views: 241
Reputation: 1456
The following function should give you the desired identifier variable NEW_VAR
.
identifier <- \(df) {
x <- array(0L, dim = nrow(df))
count <- 0L
my_seq <- seq_len(nrow(df))
for (i in my_seq) {
if(!df[i,]$start %in% df$end) {
x[i] <- my_seq[i] + count
} else {
x[i] <- my_seq[i]-1L + count
count <- count - 1L
}
}
x
}
Examples
# your example
toy_data <- data.frame(start = c(1,10,5,6,16),
end = c(10,15,9,11,17))
toy_data$NEW_VAR <- identifier(toy_data)
# ---------------------
> toy_data$NEW_VAR
[1] 1 1 2 3 4
# other example
toy_data <- data.frame(start = c(1, 2, 2, 4, 16, 21, 18, 3),
end = c(16, 2, 21, 2, 2, 2, 3, 1))
toy_data$NEW_VAR <- identifier(toy_data)
# ---------------------
> toy_data$NEW_VAR
[1] 0 0 0 1 1 1 2 2
Upvotes: 1
Reputation: 30484
You could try adapting this answer to group by ranges that are adjacent to each other. Credit goes entirely to @r2evans.
In this case, you would use expand.grid
to get combinations of start
and end
. Instead of labels you would have row numbers rn
to reference.
In the end, you can number the groups based on which rows appear together in the list. The last few lines starting with enframe
use tibble
/tidyverse
. To match the group numbers I resorted the results too.
I hope this might be helpful.
library(tidyverse)
toy_data <- data.frame(start = c(1,5,6,10,16),
end = c(10,9,11,15,17))
toy_data$rn = 1:nrow(toy_data)
eg <- expand.grid(a = seq_len(nrow(toy_data)), b = seq_len(nrow(toy_data)))
eg <- eg[eg$a < eg$b,]
together <- cbind(
setNames(toy_data[eg$a,], paste0(names(toy_data), "1")),
setNames(toy_data[eg$b,], paste0(names(toy_data), "2"))
)
together <- subset(together, end1 == start2)
groups <- split(together$rn2, together$rn1)
for (i in toy_data$rn) {
ind <- (i == names(groups)) | sapply(groups, `%in%`, x = i)
vals <- groups[ind]
groups <- c(
setNames(list(unique(c(i, names(vals), unlist(vals)))), i),
groups[!ind]
)
}
min_row <- as.numeric(sapply(groups, min))
ctr <- seq_along(groups)
lapply(ctr[order(match(min_row, ctr))], \(x) toy_data[toy_data$rn %in% groups[[x]], ]) %>%
enframe() %>%
unnest(col = value) %>%
select(-rn)
Output
name start end
<int> <dbl> <dbl>
1 1 1 10
2 1 10 15
3 2 5 9
4 3 6 11
5 4 16 17
Upvotes: 1