Reputation: 9
I'm looking at trying to merge two consecutive rows of data using common column(s) data. In essence I'm trying to go from
UserID Geography Login Logout
user1 East 0:00:22 -
user1 East - 0:01:29
user2 West 0:03:57 -
user2 West - 0:48:10
user3 South 0:59:25 -
user3 South - 1:08:21
to
UserID Geography Login Logout
user1 East 0:00:22 0:01:29
user2 West 0:03:57 0:48:10
user3 South 0:59:25 1:08:21
My apologies in advance for the formatting. I wanted to mention that there are multiple rows like this containing user1,user2 etc's data so an aggregation function such as MAX or MIN would not work. Also the solution I'm looking for is in R but any other language would also be most welcome.
Thanks in advance, Gopal
Upvotes: 1
Views: 63
Reputation: 11957
This can be accomplished with the dplyr and tidyr packages. In essence, we gather the login and logout times into a single column, remove empty values, and re-spread login and logout events into their own columns.
df1 <- read.table(text = 'UserID Geography Login Logout
user1 East 0:00:22 -
user1 East - 0:01:29
user2 West 0:03:57 -
user2 West - 0:48:10
user3 South 0:59:25 -
user3 South - 1:08:21', header = T)
UserID Geography Login Logout
1 user1 East 0:00:22 -
2 user1 East - 0:01:29
3 user2 West 0:03:57 -
4 user2 West - 0:48:10
5 user3 South 0:59:25 -
6 user3 South - 1:08:21
library(dplyr)
library(tidyr)
df2 <- df1 %>%
gather(action, time, -UserID, -Geography) %>%
filter(time != '-') %>%
spread(action, time)
UserID Geography Login Logout
1 user1 East 0:00:22 0:01:29
2 user2 West 0:03:57 0:48:10
3 user3 South 0:59:25 1:08:21
In the OP's original data set, multiple logins per user are possible:
df <- read.table(text = 'UserID Geography EventType ChannelType Time
user4 South Log-in Web 0:00:10
user1 East Log-in Web 0:00:22
user4 South Log-out Mobile 0:00:44
user1 East Log-out Web 0:01:29
user5 East Log-in Web 0:02:01
user1 East Log-in Mobile 0:03:57
user16 South Log-in Mobile 0:04:36
user15 North Log-in Mobile 0:05:42
user3 North Log-in Web 0:05:59
user8 South Log-in Mobile 0:07:09
user19 North Log-in Mobile 0:09:22
user11 North Log-in Web 0:12:39
user8 South Log-out Web 0:18:32
user8 South Log-in Web 0:19:35', header = T, stringsAsFactors = F)
The key is to use dplyr to group logins and logout per user, then number these. With each login/logout pairing now uniquely identified, the data can be reformatted:
df2 <- df %>%
arrange(UserID, Time) %>%
group_by(UserID, EventType) %>%
mutate(EventNum = 1:n()) %>%
select(-ChannelType) %>%
spread(EventType, Time, fill = '-') %>%
arrange(`Log-in`)
UserID Geography EventNum `Log-in` `Log-out`
<chr> <chr> <int> <chr> <chr>
1 user4 South 1 0:00:10 0:00:44
2 user1 East 1 0:00:22 0:01:29
3 user5 East 1 0:02:01 -
4 user1 East 2 0:03:57 -
5 user16 South 1 0:04:36 -
6 user15 North 1 0:05:42 -
7 user3 North 1 0:05:59 -
8 user8 South 1 0:07:09 0:18:32
9 user19 North 1 0:09:22 -
10 user11 North 1 0:12:39 -
11 user8 South 2 0:19:35 -
Upvotes: 1