r - Adding a row index based on a combination of multiple columns in a large dataframe

Question

I have a relatively large data frame spanning 5.5M observations containing 2 variables, here is a sample of the data. It contains information about days when a user generated a pageview.

 timestamp     user_id
2013-11-07  ff268cef0c29 
2013-11-02  12bb7af7a842 
2013-11-30  e45abb10ae0b 
2013-11-06  e45abb10ae0b
2013-11-25  f266f8c9580e

Note, that the data frame is ordered by first the user_id and then the timestamp, through

kstaord<-kstaord[order(kstaord$timestamp,kstaord$user_id),]

My goal is to create an index column, that would index all page views of a user based on the date of their occurrence, which should result in something like this:

 timestamp     user_id     index
2013-11-07  ff268cef0c29     1 
2013-11-02  12bb7af7a842     1 
2013-11-30  e45abb10ae0b     1 
2013-11-06  e45abb10ae0b     2
2013-11-25  f266f8c9580e     1

So far I have tried the for loop:

for (j in 2:nrow(kstaord)) {
if (kstaord$user_id[j]!=kstaord$user_id[j-1]) {kstaord$index[j]<-1}
else {kstaord$index[j]<-kstaord$index[j-1]+1}
if ((j %% 100000)==0) {print(".")}
}

Which takes a lot of time, yet creates inconsistent results. I have found a stack overflow thread here which deals with a similar issues. I have tried the following suggestion this thread, which I am unsure of (my search for the ave function says it produces average values):

index <- ave( 1:nrow(kstaord), kstaord$user_id, factor( kstaord$timestamp),          
FUN=function(x) 1:length(x) )

I was however unable to let this function run its full course (I am working on a server installation of R and currently I have a bad connection, so I can not be sure this doesn't happen due to my connection problems). The original function also had the secondary dimension defined as a factor, while I am using a date variable so this might be an issue as well.

I am looking for a solution that would give consistent results and would run without problems on a server where I can use almost 24 GB RAM at most.

Thanks in advance for any help.

agstudy · Accepted Answer

Use a data.table:

library(data.table)
DT <- as.data.table(dat)
DT[, index := seq_len(.N), by = user_id]
   timestamp      user_id index
1: 2013-11-07 ff268cef0c29     1
2: 2013-11-02 12bb7af7a842     1
3: 2013-11-30 e45abb10ae0b     1
4: 2013-11-06 e45abb10ae0b     2
5: 2013-11-25 f266f8c9580e     1

r - Adding a row index based on a combination of multiple columns in a large dataframe

Answers (2)

Related Questions