creativename
creativename

Reputation: 418

Loop to Replace Matching Values

I'm looking for an easy and elegant way to accomplish this.
So if I have dataset x and relationship is A -> B -> Z -> Y and D -> H -> G, I would like to create dataset y. Unfortunately, they are not necessarily in order:

> x <- data.frame(
+     from = as.character(c("A", "E", "B", "D", "H", "Z")), 
+     to = as.character(c("B", "E", "Z", "H", "G", "Y")))
> 
> y <- data.frame(
+     from = as.character(c("A", "E", "B", "D", "H", "Z")), 
+     to = as.character(c("Y", "E", "Y", "G", "G", "Y")))
> 
> x
  from to
1    A  B
2    E  E
3    B  Z
4    D  H
5    H  G
6    Z  Y
> y
  from to
1    A  Y
2    E  E
3    B  Y
4    D  G
5    H  G
6    Z  Y


I have a fairly large dataset (currently 500k rows; will grow in the future) and actually care about the performance; I'm not sure if there are any other ways to do this without a for-loop or even to vectorize/parallelize the process.
I'm thinking about splitting and removing all rows where from == to or creating an indicator to skip certain rows so the loop does not have to go through the entire dataset each time.
I'd also like to know what the breakpoint should be if I do create a loop; I'm not sure how to define when the loop should stop.
Any suggestions would be appreciated. Thanks!

Upvotes: 1

Views: 37

Answers (2)

MKR
MKR

Reputation: 20085

Another solution can be achieved using lag from dplyr and fill from tidyr as:

library(tidyverse)

x %>% arrange(from) %>%
  mutate(samegroup = ifelse(from == lag(to), 1, 0)) %>%
  mutate(group = ifelse(samegroup == 0 | is.na(samegroup), row_number(), NA)) %>%
  fill(group) %>%
  group_by(group) %>%
  mutate(to = last(to)) %>%
  ungroup() %>%
  select(-samegroup, - group)

# A tibble: 6 x 2
#  from  to   
#  <chr> <chr>
#1 A     D    
#2 B     D    
#3 C     D    
#4 E     E    
#5 F     H 
#6 G     H 

Data used

x <- data.frame(from = as.character(c("A", "B", "F", "C", "G", "E")), 
   to = as.character(c("B", "C", "G", "D", "H", "E")), 
   stringsAsFactors = FALSE)

Upvotes: 1

akrun
akrun

Reputation: 886938

We can use dplyr to create a grouping variable by comparing the adjacent elements of 'to' and 'from' and change the values in 'to' the last element of 'to'

library(dplyr)
x %>% 
    group_by(grp = cumsum(lag(lead(from, default = last(from)) != 
      as.character(to), default = TRUE))) %>% 
    mutate(to = last(to)) %>%
    ungroup %>%
    select(-grp)
# A tibble: 4 x 2
#  from   to    
# <fctr> <fctr>
#1 A      D     
#2 B      D     
#3 C      D     
#4 E      E    

Upvotes: 1

Related Questions