Reputation: 418
I'm looking for an easy and elegant way to accomplish this.
So if I have dataset x
and relationship is A -> B -> Z -> Y
and D -> H -> G
, I would like to create dataset y
. Unfortunately, they are not necessarily in order:
> x <- data.frame(
+ from = as.character(c("A", "E", "B", "D", "H", "Z")),
+ to = as.character(c("B", "E", "Z", "H", "G", "Y")))
>
> y <- data.frame(
+ from = as.character(c("A", "E", "B", "D", "H", "Z")),
+ to = as.character(c("Y", "E", "Y", "G", "G", "Y")))
>
> x
from to
1 A B
2 E E
3 B Z
4 D H
5 H G
6 Z Y
> y
from to
1 A Y
2 E E
3 B Y
4 D G
5 H G
6 Z Y
I have a fairly large dataset (currently 500k rows; will grow in the future) and actually care about the performance; I'm not sure if there are any other ways to do this without a for-loop or even to vectorize/parallelize the process.
I'm thinking about splitting and removing all rows where from == to
or creating an indicator to skip certain rows so the loop does not have to go through the entire dataset each time.
I'd also like to know what the breakpoint should be if I do create a loop; I'm not sure how to define when the loop should stop.
Any suggestions would be appreciated. Thanks!
Upvotes: 1
Views: 37
Reputation: 20085
Another solution can be achieved using lag
from dplyr
and fill
from tidyr
as:
library(tidyverse)
x %>% arrange(from) %>%
mutate(samegroup = ifelse(from == lag(to), 1, 0)) %>%
mutate(group = ifelse(samegroup == 0 | is.na(samegroup), row_number(), NA)) %>%
fill(group) %>%
group_by(group) %>%
mutate(to = last(to)) %>%
ungroup() %>%
select(-samegroup, - group)
# A tibble: 6 x 2
# from to
# <chr> <chr>
#1 A D
#2 B D
#3 C D
#4 E E
#5 F H
#6 G H
x <- data.frame(from = as.character(c("A", "B", "F", "C", "G", "E")),
to = as.character(c("B", "C", "G", "D", "H", "E")),
stringsAsFactors = FALSE)
Upvotes: 1
Reputation: 886938
We can use dplyr
to create a grouping variable by comparing the adjacent elements of 'to' and 'from' and change the values in 'to' the last
element of 'to'
library(dplyr)
x %>%
group_by(grp = cumsum(lag(lead(from, default = last(from)) !=
as.character(to), default = TRUE))) %>%
mutate(to = last(to)) %>%
ungroup %>%
select(-grp)
# A tibble: 4 x 2
# from to
# <fctr> <fctr>
#1 A D
#2 B D
#3 C D
#4 E E
Upvotes: 1