Reputation: 11
So I have this data of customers buying airline tickets online. I want to see how many of them have booked return tickets. So basically I want to match the origin city to the destination city of the immediate row and vice versa for the same PERSON and Account that will give me a data with their two way travel and then I want to calculate the days of their journey. I am trying to do this in R but I am unable to match origin with destination of the immediate row and vice versa.
I have sorted the account numbers of customers to manually see if there are any return journeys and there are quite a lot.
The data is like this:
Account number origin city Destination city Date
1 London chicago 7/22/2018
2 Milan London 7/23/2018
2 London Milan 7/28/2018
1 chicago london 8/22/2018
Upvotes: 1
Views: 46
Reputation: 66520
Another option would be to join to itself, with the fields reversed.
Edit: Added "trip_num" to deal better with repeated trips by the same person.
library(dplyr)
# First, convert date field to Date type
df <- df %>%
mutate(Date = lubridate::mdy(Date)) %>%
# update with M-M's suggestion in comments
mutate_at(.vars = vars(origin_city, Destination_city), .funs = toupper) %>%
# EDIT: adding trip_num to protect against extraneous joins for repeat trips
group_by(Account_number, origin_city, Destination_city) %>%
mutate(trip_num = row_number()) %>%
ungroup()
df2 <- df %>%
left_join(df, by = c("Account_number", "trip_num",
"origin_city" = "Destination_city",
"Destination_city" = "origin_city")) %>%
mutate(days = (Date.x - Date.y)/lubridate::ddays(1))
> df2
# A tibble: 6 x 7
Account_number origin_city Destination_city Date.x trip_num Date.y days
<int> <chr> <chr> <date> <int> <date> <dbl>
1 1 LONDON CHICAGO 2018-07-22 1 2018-08-22 -31
2 2 MILAN LONDON 2018-07-23 1 2018-07-28 -5
3 2 LONDON MILAN 2018-07-28 1 2018-07-23 5
4 1 CHICAGO LONDON 2018-08-22 1 2018-07-22 31
5 2 MILAN LONDON 2018-08-23 2 2018-08-28 -5
6 2 LONDON MILAN 2018-08-28 2 2018-08-23 5
Data: (added repeat trip by Account_number 2)
df <- read.table(
header = T,
stringsAsFactors = F,
text = "Account_number origin_city Destination_city Date
1 London chicago 7/22/2018
2 Milan London 7/23/2018
2 London Milan 7/28/2018
1 chicago london 8/22/2018
2 Milan London 8/23/2018
2 London Milan 8/28/2018")
Upvotes: 2