Alban Couturier
Alban Couturier

Reputation: 129

Collapse and count the number of unique value

I am currently working on an application where I have a dataframe that looks like this:

Database
UserId         Hour         Date
01                18           01.01.2016
01                18           01.01.2016
01                14           02.01.2016
01                14           02.01.2016
02                21           02.01.2016
02                08           05.01.2016
02                08           05.01.2016
03                23           05.01.2016

Each line represents a session.

I need to determine whether the time of the first session of a user has an impact on the number of sessions this user is going to have.

I have tried the command summaryBy:

library(doBy)
first_hour <- summaryBy(UserId + Hour + Date ~ UserId, 
    FUN=c(head, length, unique), database)

But it doesn't give me the correct result.

My goal here is to determine the Hour of the first session a user takes, determine how many sessions and how many different session dates a user has.

Upvotes: 3

Views: 2212

Answers (3)

nya
nya

Reputation: 2250

Using base commands, you can write your own function to select desired information:

user.info <- function(user){
    temp <- subset(Database, Database$UserId == user)
    return(c(UserId=user, FirstHour=temp$Hour[1], Sessions=nrow(temp), Dates=length(unique(temp$Date))))
}

t(sapply(unique(Database$UserId), FUN=user.info)) 
#     UserId FirstHour Sessions Dates
# [1,]      1        18        4     2
# [2,]      2        21        3     2
# [3,]      3        23        1     1

Here, FirstHour is the hour on the first listed row for the given user, Sessions is the number of rows for the user and Dates is the number of different dates listed for the user.

The function is applied to all unique users and the final table is transposed.

Upvotes: 0

akrun
akrun

Reputation: 887088

We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'UserId', we order the 'Date', get the first 'Hour', total number of sessions (.N) and the number of unique Date elements (uniqueN(Date)).

library(data.table)
setDT(df1)[order(UserId, as.Date(Date, "%m.%d.%Y")),.(Hour = Hour[1L],
      Sessions = .N, DifferSessionDate = uniqueN(Date)) , by = UserId]
#    UserId Hour Sessions DifferSessionDate
#1:      1   18        4                 2
#2:      2   21        3                 2
#3:      3   23        1                 1

Upvotes: 2

David_B
David_B

Reputation: 926

You could also do this using dplyr:

library(dplyr)
dt %>% group_by(UserId) %>% summarise(FirstHour = min(Hour),
                                      NumSessions = n(),
                                      NumDates = length(unique(Date)))

Source: local data frame [3 x 4]

  UserId FirstHour NumSessions NumDates
   (int)     (int)       (int)    (int)
1      1        14           4        2
2      2         8           3        2
3      3        23           1        1

Upvotes: 0

Related Questions