Reputation: 231
I have a dataset with travels that looks like this
from <- c("NYC","PAR", "MAD")
to <- c('PAR', 'SYD', "BCN")
date <- c("05/07","05/07", "06/08")
step <- c(1, 2, 1)
df <- data.frame(from, to, date, step)
The "step" column tells the step of the trip. Some are simple travels (like MAD-BCN), but others have a step (like NYC-PAR-SYD) (or sometimes two steps).
I would like to create a column that resumes the complete trip if there is a step. In our example we should create a column "trip" that contains "NYC-PAR-SYD" for the 2 first rows (NYC-PAR and PAR-SYD), and just "MAD-BCN" for the third row.
The problem for me is that there are hundreds of thousands of rows, and I would like to find an effective way to do it.
Upvotes: 1
Views: 107
Reputation: 21908
I first thought of a way for a problem with 2 steps but as I noticed there may be more steps involved in your case I decided to generalize my solution. I will post both of them here:
library(dplyr)
df %>%
mutate(id = cumsum(step == 1)) %>%
group_by(id) %>%
pivot_longer(c(from, to), names_to = "dest", values_to = "val") %>%
mutate(dup = val == lag(val, default = "Anoush")) %>%
filter(!dup) %>%
mutate(across(val, ~ paste(.x, collapse = "-"))) %>%
select(-dup) %>%
slice(n = 2:n())
# A tibble: 3 x 5
# Groups: id [2]
date step id dest val
<chr> <dbl> <int> <chr> <chr>
1 05/07 1 1 to NYC-PAR-SYD
2 05/07 2 1 to NYC-PAR-SYD
3 06/08 1 2 to MAD-BCN
# With @AnilGoyal's data set
# A tibble: 8 x 4
# Groups: id [4]
step id dest val
<int> <int> <chr> <chr>
1 1 1 to ABC-BCD-DEF-FGH-GHI
2 2 1 to ABC-BCD-DEF-FGH-GHI
3 3 1 to ABC-BCD-DEF-FGH-GHI
4 4 1 to ABC-BCD-DEF-FGH-GHI
5 1 2 to CDE-GHI
6 1 3 to IJK-JKL-LMN
7 2 3 to IJK-JKL-LMN
8 1 4 to LMN-OPQ
My first solution for only two steps which only works for your data set:
library(stringr)
df %>%
mutate(start = ifelse(from == lag(to), lag(from), NA),
finish = ifelse(to == lead(from), lead(to), NA)) %>%
rowwise() %>%
mutate(itn = ifelse(is.na(start), paste(from, to, finish, sep = "-"),
ifelse(is.na(finish), paste(start, from, to, sep = "-"),
paste(from, to, sep = "-")))) %>%
select(-c(start, finish)) %>%
mutate(across(itn, ~ str_replace(., "-NA", "")))
# A tibble: 3 x 5
# Rowwise:
from to date step itn
<chr> <chr> <chr> <dbl> <chr>
1 NYC PAR 05/07 1 NYC-PAR-SYD
2 PAR SYD 05/07 2 NYC-PAR-SYD
3 MAD BCN 06/08 1 MAD-BCN
Upvotes: 1
Reputation: 26218
A simple tidyverse
strategy on an elaborated example
df <- data.frame(
stringsAsFactors = FALSE,
from = c("ABC", "BCD", "DEF", "FGH", "CDE", "IJK", "JKL", "LMN"),
to = c("BCD", "DEF", "FGH", "GHI", "GHI", "JKL", "LMN", "OPQ"),
step = c(1L, 2L, 3L, 4L, 1L, 1L, 2L, 1L)
)
df
#> from to step
#> 1 ABC BCD 1
#> 2 BCD DEF 2
#> 3 DEF FGH 3
#> 4 FGH GHI 4
#> 5 CDE GHI 1
#> 6 IJK JKL 1
#> 7 JKL LMN 2
#> 8 LMN OPQ 1
library(tidyverse)
df %>%
mutate(id = cumsum(step == 1)) %>%
split(.$id) %>%
map_dfr(~ .x %>%
pivot_longer(c('from', 'to')) %>%
summarise(id = first(id),
itinerary = paste(value[!duplicated(value)], collapse = '->')))
#> # A tibble: 4 x 2
#> id itinerary
#> <int> <chr>
#> 1 1 ABC->BCD->DEF->FGH->GHI
#> 2 2 CDE->GHI
#> 3 3 IJK->JKL->LMN
#> 4 4 LMN->OPQ
Created on 2021-06-01 by the reprex package (v2.0.0)
Upvotes: 2