Reputation: 35
I have a relatively large data frame spanning 5.5M observations containing 2 variables, here is a sample of the data. It contains information about days when a user generated a pageview.
timestamp user_id
2013-11-07 ff268cef0c29
2013-11-02 12bb7af7a842
2013-11-30 e45abb10ae0b
2013-11-06 e45abb10ae0b
2013-11-25 f266f8c9580e
Note, that the data frame is ordered by first the user_id and then the timestamp, through
kstaord<-kstaord[order(kstaord$timestamp,kstaord$user_id),]
My goal is to create an index column, that would index all page views of a user based on the date of their occurrence, which should result in something like this:
timestamp user_id index
2013-11-07 ff268cef0c29 1
2013-11-02 12bb7af7a842 1
2013-11-30 e45abb10ae0b 1
2013-11-06 e45abb10ae0b 2
2013-11-25 f266f8c9580e 1
So far I have tried the for loop:
for (j in 2:nrow(kstaord)) {
if (kstaord$user_id[j]!=kstaord$user_id[j-1]) {kstaord$index[j]<-1}
else {kstaord$index[j]<-kstaord$index[j-1]+1}
if ((j %% 100000)==0) {print(".")}
}
Which takes a lot of time, yet creates inconsistent results. I have found a stack overflow thread here which deals with a similar issues. I have tried the following suggestion this thread, which I am unsure of (my search for the ave function says it produces average values):
index <- ave( 1:nrow(kstaord), kstaord$user_id, factor( kstaord$timestamp),
FUN=function(x) 1:length(x) )
I was however unable to let this function run its full course (I am working on a server installation of R and currently I have a bad connection, so I can not be sure this doesn't happen due to my connection problems). The original function also had the secondary dimension defined as a factor, while I am using a date variable so this might be an issue as well.
I am looking for a solution that would give consistent results and would run without problems on a server where I can use almost 24 GB RAM at most.
Thanks in advance for any help.
Upvotes: 2
Views: 2423
Reputation: 67778
An ave
alternative. If I understood your question correctly, you don't need 'timestamp' as a grouping variable. It should be enough with 'user_id'.
df$index <- with(df, ave(user_id, user_id, FUN = seq_along))
df
# timestamp user_id index
# 1 2013-11-07 ff268cef0c29 1
# 2 2013-11-02 12bb7af7a842 1
# 3 2013-11-30 e45abb10ae0b 1
# 4 2013-11-06 e45abb10ae0b 2
# 5 2013-11-25 f266f8c9580e 1
And yes, you are correct that the default function in ave
"produces average values" (FUN = mean
). But you may set that FUN
ction to whatever function that suits your needs.
On larger data sets, data.table
alternatives are most likely faster (such as the answer provided by @agstudy).
Upvotes: 1
Reputation: 121568
Use a data.table
:
library(data.table)
DT <- as.data.table(dat)
DT[, index := seq_len(.N), by = user_id]
timestamp user_id index
1: 2013-11-07 ff268cef0c29 1
2: 2013-11-02 12bb7af7a842 1
3: 2013-11-30 e45abb10ae0b 1
4: 2013-11-06 e45abb10ae0b 2
5: 2013-11-25 f266f8c9580e 1
Upvotes: 2