Awans
Awans

Reputation: 231

How to create overall itinerary

I have a dataset with travels that looks like this

from <- c("NYC","PAR", "MAD")
to <- c('PAR', 'SYD', "BCN")
date <- c("05/07","05/07", "06/08")
step <- c(1, 2, 1)
df <- data.frame(from, to, date, step)

The "step" column tells the step of the trip. Some are simple travels (like MAD-BCN), but others have a step (like NYC-PAR-SYD) (or sometimes two steps).

I would like to create a column that resumes the complete trip if there is a step. In our example we should create a column "trip" that contains "NYC-PAR-SYD" for the 2 first rows (NYC-PAR and PAR-SYD), and just "MAD-BCN" for the third row.

The problem for me is that there are hundreds of thousands of rows, and I would like to find an effective way to do it.

Upvotes: 1

Views: 107

Answers (2)

Anoushiravan R
Anoushiravan R

Reputation: 21908

I first thought of a way for a problem with 2 steps but as I noticed there may be more steps involved in your case I decided to generalize my solution. I will post both of them here:

library(dplyr)

df %>%
  mutate(id = cumsum(step == 1)) %>%
  group_by(id) %>%
  pivot_longer(c(from, to), names_to = "dest", values_to = "val") %>%
  mutate(dup = val == lag(val, default = "Anoush")) %>%
  filter(!dup) %>%
  mutate(across(val, ~ paste(.x, collapse = "-"))) %>%
  select(-dup) %>%
  slice(n = 2:n())

# A tibble: 3 x 5
# Groups:   id [2]
  date   step    id dest  val        
  <chr> <dbl> <int> <chr> <chr>      
1 05/07     1     1 to    NYC-PAR-SYD
2 05/07     2     1 to    NYC-PAR-SYD
3 06/08     1     2 to    MAD-BCN 

# With @AnilGoyal's data set

# A tibble: 8 x 4
# Groups:   id [4]
   step    id dest  val                
  <int> <int> <chr> <chr>              
1     1     1 to    ABC-BCD-DEF-FGH-GHI
2     2     1 to    ABC-BCD-DEF-FGH-GHI
3     3     1 to    ABC-BCD-DEF-FGH-GHI
4     4     1 to    ABC-BCD-DEF-FGH-GHI
5     1     2 to    CDE-GHI            
6     1     3 to    IJK-JKL-LMN        
7     2     3 to    IJK-JKL-LMN        
8     1     4 to    LMN-OPQ  

My first solution for only two steps which only works for your data set:

library(stringr)

df %>%
  mutate(start = ifelse(from == lag(to), lag(from), NA),
         finish = ifelse(to == lead(from), lead(to), NA)) %>%
  rowwise() %>%
  mutate(itn = ifelse(is.na(start), paste(from, to, finish, sep = "-"), 
                      ifelse(is.na(finish), paste(start, from, to, sep = "-"), 
                             paste(from, to, sep = "-")))) %>%
  select(-c(start, finish)) %>%
  mutate(across(itn, ~ str_replace(., "-NA", "")))


# A tibble: 3 x 5
# Rowwise: 
  from  to    date   step itn        
  <chr> <chr> <chr> <dbl> <chr>      
1 NYC   PAR   05/07     1 NYC-PAR-SYD
2 PAR   SYD   05/07     2 NYC-PAR-SYD
3 MAD   BCN   06/08     1 MAD-BCN  

Upvotes: 1

AnilGoyal
AnilGoyal

Reputation: 26218

A simple tidyverse strategy on an elaborated example

df <- data.frame(
  stringsAsFactors = FALSE,
              from = c("ABC", "BCD", "DEF", "FGH", "CDE", "IJK", "JKL", "LMN"),
                to = c("BCD", "DEF", "FGH", "GHI", "GHI", "JKL", "LMN", "OPQ"),
              step = c(1L, 2L, 3L, 4L, 1L, 1L, 2L, 1L)
)

df
#>   from  to step
#> 1  ABC BCD    1
#> 2  BCD DEF    2
#> 3  DEF FGH    3
#> 4  FGH GHI    4
#> 5  CDE GHI    1
#> 6  IJK JKL    1
#> 7  JKL LMN    2
#> 8  LMN OPQ    1

library(tidyverse)

df %>%
  mutate(id = cumsum(step == 1)) %>%
  split(.$id) %>%
  map_dfr(~ .x %>%
            pivot_longer(c('from', 'to')) %>%
            summarise(id = first(id),
                      itinerary = paste(value[!duplicated(value)], collapse = '->')))

#> # A tibble: 4 x 2
#>      id itinerary              
#>   <int> <chr>                  
#> 1     1 ABC->BCD->DEF->FGH->GHI
#> 2     2 CDE->GHI               
#> 3     3 IJK->JKL->LMN          
#> 4     4 LMN->OPQ

Created on 2021-06-01 by the reprex package (v2.0.0)

Upvotes: 2

Related Questions