Reputation: 57
I have this data frame with records of people coming to or leaving the country. 'date' and 'inout' indicates whether a person has traveled in ('I') or out ('O') of the country on what date. Id = 1 travels out of the country on 2008-10-06 and returns again on 2009-04-30.
Data:
df <- data.frame( id=c(1,1,2,2,2,2,3),
date=c('2008-10-06','2009-04-30', '1999-07-25','2004-02-27','2005-06-09','2013-07-01','2010-09-07'),
inout = c('O','I','I','O','I','O','I'))
id date inout
1 1 2008-10-06 O
2 1 2009-04-30 I
3 2 1999-07-25 I
4 2 2004-02-27 O
5 2 2005-06-09 I
6 2 2013-07-01 O
7 3 2010-09-07 I
I need a new data frame like this:
id start end destination
1 1 1900-01-01 2008-10-06 home
2 1 2008-10-06 2009-04-30 abroad
3 1 2009-04-30 2017-08-01 home
4 2 1900-01-01 1999-07-25 abroad
5 2 1999-07-25 2004-02-27 home
6 2 2004-02-27 2005-06-09 abroad
7 2 2005-06-09 2013-07-01 home
8 2 2013-07-01 2017-08-01 abroad
9 3 1900-01-01 2010-09-07 abroad
10 3 2010-09-07 2017-08-01 home
Where each person's first stay starts with a default date 1900-01-01 and the last stay ends with the current date (2017-08-01). In this data frame, Id = 1 is home from 1900-01-01 to 2008-10-06 and abroad from 2008-10-06 to 2009-04-30 and home again from 2009-04-30 until 2017-08-01.
Can anyone help me with that. Preferably using the dplyr package if necessary. Best regards
Upvotes: 2
Views: 448
Reputation: 32548
Base R. Pretty messy. Seems to work.
do.call(rbind, lapply(split(df, df$id), function(a) {
cbind(id = rep(a$id, length.out = NROW(a)+1),
setNames(object = data.frame(do.call(
rbind, lapply(1:(NROW(a) + 1), function(i)
c("1970-01-01", as.character(a$date), "2017-08-01")[i:(i + 1)])
)),
nm = c("Start", "End")),
Destination = if (a$inout[1] == "O") {
rep(x = c("home", "abroad"),
length.out = NROW(a) + 1)
} else{
rep(x = c("abroad", "home"),
length.out = NROW(a) + 1)
})
}))
# id Start End Destination
#1.1 1 1970-01-01 2008-10-06 home
#1.2 1 2008-10-06 2009-04-30 abroad
#1.3 1 2009-04-30 2017-08-01 home
#2.1 2 1970-01-01 1999-07-25 abroad
#2.2 2 1999-07-25 2004-02-27 home
#2.3 2 2004-02-27 2005-06-09 abroad
#2.4 2 2005-06-09 2013-07-01 home
#2.5 2 2013-07-01 2017-08-01 abroad
#3.1 3 1970-01-01 2010-09-07 abroad
#3.2 3 2010-09-07 2017-08-01 home
Upvotes: 0
Reputation: 2826
Here's my solution. It assumes that df
has two consecutive rows for each trip, and therefore there is an even number of rows in the dataset (it will not work otherwise). In order to test this, then, I added a new row to df
, which as pasted above only had 7 rows and therefore would have the fourth trip incomplete:
library(dplyr)
library(tidyr)
library(lubridate)
df %>%
mutate(trips = rep(seq(1, n() / 2), each = 2)) %>%
group_by(trips) %>%
spread(inout, date) %>%
mutate(start = if_else(date(I) < date(O), I, O),
end = if_else(date(I) < date(O), O, I),
destination = if_else(date(I) < date(O), 'home', 'abroad')) %>%
ungroup %>%
select(-c(trips, I, O))
## id start end destination
## <chr> <fctr> <fctr> <chr>
## 1 1 2008-10-06 2009-04-30 abroad
## 2 2 1999-07-25 2004-02-27 home
## 3 2 2005-06-09 2013-07-01 home
## 4 3 2010-09-07 2012-03-08 home
Upvotes: 2