Reputation: 2047
I have the following data:
> test2
video_id created count
1 10022 2012-07-04 1
2 40001 2012-07-04 1
3 70001 2012-07-04 14
4 70029 2012-07-04 14
5 100001 2012-07-04 1
6 100008 2012-07-04 3
7 100022 2012-07-15 1
8 440003 2012-07-15 2
9 470010 2012-07-15 1
This was aggregated from a previous data set, to output the number of views $count
per date $created
for each video $video_id
, using the following code. At first, the values in the $created
column were generated using as.Date()
, but aggregate()
would output them as NULL
, so I converted them all to characters with as.character()
.
test$created <- as.character(test$created)
test["count"] <- 1
test2 <- aggregate(test[c("count")], by=list(video_id=test$video_id,
created=test$created), FUN=sum, na.rm=TRUE)
What I would like to do now is to output the video $video_id
with the maximum views $count
for each date $created
. It is not obvious to me how I can obtain this using the aggregate()
function, especially if there are more than two videos with the same value for $count
, in which case I would like to output all of them (ultimately, I would like to pick one of those "top viewed videos" at random, which makes it unsatisfactory to just settle with the first max()
that comes up).
Would it be better to separate the data into as many data frames for each date so that it will be easier to perform the analysis on each date separately?
Upvotes: 0
Views: 77
Reputation: 1479
To do the aggregation try the package data.table
library(data.table)
test<-data.table(test)
test[,.SD[which.max(count)],by=created]
To get a random one I would just shuffle the data first:
test<-test[sample(1:nrow(test)),]
Upvotes: 2
Reputation: 70296
another solution with dplyr (which is more easy to read IMO):
library(dplyr)
test %.% group_by(created) %.% filter(count %in% max(count))
of course you could use sample
afterwards
Upvotes: 0
Reputation: 6277
The use of aggregate
can indeed be tricky. In your case though, you can use the by
function,which will effectively split of your data.frame into several chunks and returns a list. For example, on your dataset:
> ttt = read.table(text='
+ video_id created count
+ 1 10022 2012-07-04 1
+ 2 40001 2012-07-04 1
+ 3 70001 2012-07-04 14
+ 4 70029 2012-07-04 14
+ 5 100001 2012-07-04 1
+ 6 100008 2012-07-04 3
+ 7 100022 2012-07-15 1
+ 8 440003 2012-07-15 2
+ 9 470010 2012-07-15 1',header=T)
> by(ttt,list(ttt$created),function(x) x$video_id[x$count==max(x$count)] )
: 2012-07-04
[1] 70001 70029
--------------------------------------------------------------------------
: 2012-07-15
[1] 440003
This gives you a list of most-watched video for each date, which is the final result you wanted. You can further use unlist
if you want the result as a vector, and finally sample
to get only one video_id
from this pool.
Upvotes: 0