Reputation: 690
I have a CSV file with a list of posts from an online discussion forum. I have the timestamp for each post in this format: YYYY-MM-DD hh:mm:ss.
I want to calculate how often a new post is submitted, as in "X posts per second". I think what I need is just the mean, median and sd for the rate of posting (posts per second). I just loaded the CSV:
d <- read.csv("posts.csv")
colnames(d) <- c("post.id", "timestamp")
Upvotes: 1
Views: 1552
Reputation: 226771
Something like:
tt <- table(cut(as.POSIXlt(d$timestamp),"1 sec"))
c(mean(tt),median(tt),sd(tt))
You didn't provide a reproducible example so I'm not 100% sure this works, but something like that ... also don't know how well it will scale to giant data sets.
More detail (with example):
set.seed(1001)
n <- 1e5
nt <- 1e5
z <- seq(as.POSIXct("2010-09-01"),length=nt,by="1 sec")
length(z)
z2 <- sample(z,size=n,replace=TRUE)
tt <- table(cut(z2,"1 sec"))
c(mean(tt),median(tt),sd(tt))
This tiny example suggests that the cut() command might be slow. Play with the 'nt' (number of seconds in the time interval from beginning to end) and 'n' (number of samples) parameters to get a sense of how long your problem will take.
Upvotes: 2
Reputation: 263461
The average number of posts per second is just 1/interval from last posting, so make a vector of diff(times) and then take mean(1/as.numeric(diff(times))).
> posts <- data.frame(ids = paste(letters[sample(1:26, 100, replace=TRUE)],
sample(1:100) ), time=Sys.time() +cumsum(abs(rnorm(100))*100) )
> mean( 1/as.numeric(diff(posts$time)) )
[1] 0.03545346
Edit: I thought that by using cumsum I would get the time series ordered, but that was not the case, so it's amended to take abs(rnorm(100) ).
Upvotes: 3
Reputation: 1005
i dont know your programming language, but if you could convert the timestamp to milliseconds, just subtract the lowest from the highest timestamp, then divide by the number of posts (rows in the posts.csv) then divide by 1000 (milliseconds) and your left with posts per second. Or if you can get the timestamp in seconds, it is the same except don't divide by 1000.
Upvotes: 0