Gopal Sharma
Gopal Sharma

Reputation: 9

Looking to combine multiple rows of data based on common column data

I'm looking at trying to merge two consecutive rows of data using common column(s) data. In essence I'm trying to go from

 UserID Geography   Login  Logout

   user1  East      0:00:22    -       
   user1  East         -     0:01:29
   user2  West      0:03:57    -    
   user2  West         -     0:48:10
   user3  South     0:59:25    -    
   user3  South        -     1:08:21

to

  UserID Geography   Login  Logout

   user1  East      0:00:22  0:01:29       
   user2  West      0:03:57  0:48:10    
   user3  South     0:59:25  1:08:21   

My apologies in advance for the formatting. I wanted to mention that there are multiple rows like this containing user1,user2 etc's data so an aggregation function such as MAX or MIN would not work. Also the solution I'm looking for is in R but any other language would also be most welcome.

Thanks in advance, Gopal

Upvotes: 1

Views: 63

Answers (1)

jdobres
jdobres

Reputation: 11957

This can be accomplished with the dplyr and tidyr packages. In essence, we gather the login and logout times into a single column, remove empty values, and re-spread login and logout events into their own columns.

df1 <- read.table(text = 'UserID Geography   Login  Logout

              user1  East      0:00:22    -       
              user1  East         -     0:01:29
              user2  West      0:03:57    -    
              user2  West         -     0:48:10
              user3  South     0:59:25    -    
              user3  South        -     1:08:21', header = T)

  UserID Geography   Login  Logout
1  user1      East 0:00:22       -
2  user1      East       - 0:01:29
3  user2      West 0:03:57       -
4  user2      West       - 0:48:10
5  user3     South 0:59:25       -
6  user3     South       - 1:08:21

library(dplyr)
library(tidyr)
df2 <- df1 %>% 
  gather(action, time, -UserID, -Geography) %>% 
  filter(time != '-') %>% 
  spread(action, time)

  UserID Geography   Login  Logout
1  user1      East 0:00:22 0:01:29
2  user2      West 0:03:57 0:48:10
3  user3     South 0:59:25 1:08:21

Handling multiple sessions

In the OP's original data set, multiple logins per user are possible:

df <- read.table(text = 'UserID Geography   EventType ChannelType   Time 
user4   South   Log-in  Web 0:00:10 
user1   East    Log-in  Web 0:00:22 
user4   South   Log-out Mobile  0:00:44 
user1   East    Log-out Web 0:01:29 
user5   East    Log-in  Web 0:02:01 
user1   East    Log-in  Mobile 0:03:57 
user16  South   Log-in  Mobile  0:04:36 
user15  North   Log-in  Mobile  0:05:42 
user3   North   Log-in  Web 0:05:59 
user8   South   Log-in  Mobile  0:07:09 
user19  North   Log-in  Mobile  0:09:22 
user11  North   Log-in  Web 0:12:39 
user8   South   Log-out Web 0:18:32 
user8   South   Log-in  Web 0:19:35', header = T, stringsAsFactors = F)

The key is to use dplyr to group logins and logout per user, then number these. With each login/logout pairing now uniquely identified, the data can be reformatted:

df2 <- df %>% 
  arrange(UserID, Time) %>% 
  group_by(UserID, EventType) %>% 
  mutate(EventNum = 1:n()) %>% 
  select(-ChannelType) %>% 
  spread(EventType, Time, fill = '-') %>% 
  arrange(`Log-in`)

   UserID Geography EventNum `Log-in` `Log-out`
    <chr>     <chr>    <int>    <chr>     <chr>
 1  user4     South        1  0:00:10   0:00:44
 2  user1      East        1  0:00:22   0:01:29
 3  user5      East        1  0:02:01         -
 4  user1      East        2  0:03:57         -
 5 user16     South        1  0:04:36         -
 6 user15     North        1  0:05:42         -
 7  user3     North        1  0:05:59         -
 8  user8     South        1  0:07:09   0:18:32
 9 user19     North        1  0:09:22         -
10 user11     North        1  0:12:39         -
11  user8     South        2  0:19:35         -

Upvotes: 1

Related Questions