biohazard
biohazard

Reputation: 2047

Aggregate per two columns while outputting all maximums of a third column

I have the following data:

> test2
    video_id    created count
1      10022 2012-07-04     1
2      40001 2012-07-04     1
3      70001 2012-07-04    14
4      70029 2012-07-04    14
5     100001 2012-07-04     1
6     100008 2012-07-04     3
7     100022 2012-07-15     1
8     440003 2012-07-15     2
9     470010 2012-07-15     1

How I came up with this data:

This was aggregated from a previous data set, to output the number of views $count per date $created for each video $video_id, using the following code. At first, the values in the $created column were generated using as.Date(), but aggregate() would output them as NULL, so I converted them all to characters with as.character().

test$created <- as.character(test$created)
test["count"] <- 1
test2 <- aggregate(test[c("count")], by=list(video_id=test$video_id,
                   created=test$created), FUN=sum, na.rm=TRUE)

Now what?

What I would like to do now is to output the video $video_id with the maximum views $count for each date $created. It is not obvious to me how I can obtain this using the aggregate() function, especially if there are more than two videos with the same value for $count, in which case I would like to output all of them (ultimately, I would like to pick one of those "top viewed videos" at random, which makes it unsatisfactory to just settle with the first max() that comes up).

Would it be better to separate the data into as many data frames for each date so that it will be easier to perform the analysis on each date separately?

Upvotes: 0

Views: 77

Answers (3)

ZRoss
ZRoss

Reputation: 1479

To do the aggregation try the package data.table

library(data.table)
test<-data.table(test)
test[,.SD[which.max(count)],by=created]

To get a random one I would just shuffle the data first:

test<-test[sample(1:nrow(test)),]

Upvotes: 2

talat
talat

Reputation: 70296

another solution with dplyr (which is more easy to read IMO):

library(dplyr)
test %.% group_by(created) %.% filter(count %in% max(count))

of course you could use sample afterwards

Upvotes: 0

Jealie
Jealie

Reputation: 6277

The use of aggregate can indeed be tricky. In your case though, you can use the by function,which will effectively split of your data.frame into several chunks and returns a list. For example, on your dataset:

> ttt = read.table(text='
+     video_id    created count
+ 1      10022 2012-07-04     1
+ 2      40001 2012-07-04     1
+ 3      70001 2012-07-04    14
+ 4      70029 2012-07-04    14
+ 5     100001 2012-07-04     1
+ 6     100008 2012-07-04     3
+ 7     100022 2012-07-15     1
+ 8     440003 2012-07-15     2
+ 9     470010 2012-07-15     1',header=T)
> by(ttt,list(ttt$created),function(x) x$video_id[x$count==max(x$count)] )
: 2012-07-04
[1] 70001 70029
-------------------------------------------------------------------------- 
: 2012-07-15
[1] 440003

This gives you a list of most-watched video for each date, which is the final result you wanted. You can further use unlist if you want the result as a vector, and finally sample to get only one video_id from this pool.

Upvotes: 0

Related Questions