Reputation: 6649
I have a dataframe of items with a certain number of different events which occur at different times. e.g. say I had a times of events (goal, corner, red card etc...) in various games of football. I want to count the number of each events which occurred before a certain time for each team in each game (where the time is different for each game).
So I could have a dataframe of events (where C is corner, G is goal and R is red card) as follows:
events <- data.frame(
game_id = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2),
team = c(1, 1, 2, 1, 2, 2, 1, 1, 2, 2, 2, 1, 1),
event_id= c('C', 'C', 'C', 'G', 'C', 'R', 'C', 'C', 'C', 'C', 'G', 'G', 'C'),
time = c(5, 14, 27, 67, 78, 87, 10, 19, 33, 45, 60, 78, 89))
and another dataframe of times to look up for each event as follows:
eventTime <- data.frame(
game_id = c(1, 2),
time = c(45, 65))
So for game 1 I want to count the number of each event for each team before the 45th minute, and for game 2 I would want to do the same thing but for the 60th minute so return something like:
game_id time t1_C t1_G t1_R t2_C t2_G t2_R
1 45 2 0 0 1 0 0
2 65 2 0 0 2 1 0
Since in game 1 team1 had 2 corners, 0 goals and 0 red cards before the 45th minute whilst team 2 had 1 corner, 0 goals and 0 red cards.
I have been doing this by using apply to go through and subset the data I am after and counting up the rows, however I have 1000's of rows and this takes a lot of time.
Does anyone know of the quickest way of doing this?
EDIT: I failed to mention that any game_id may appear multiple times with different times in the eventTime dataframe. E.g. game_id could appear twice with times 45 and 70, I would want to get the appropriate counts for each unique event/time combination.
Upvotes: 1
Views: 970
Reputation: 6649
Thanks to both of you, I think both of your answers would have answered my initial question, but wouldn't quite work for the editted question. However I have combined parts of both of your answers to get something which works for me.
I used the first part of Ben Bolkers answer by merging the data frames and subsetting where time less than stopTime. Then converted to data table and used the last two lines of Coderemifa's answer. So somethign as follows
library(reshape)
library(reshape2)
library(plyr)
names(eventTime)[2] <- "stopTime"
events <- merge(events,eventTime)
e2 <- subset(events,time<stopTime)
eventsSubset <- data.table(e2)
eventsSubset <- eventsSubset[,list(Freq = .N), by=c('team','event_id','game_id','stopTime')]
eventsReshaped <- cast(eventsSubset, game_id + stopTime~ event_id+team, fun.aggregate = sum, value = "Freq")
Upvotes: 1
Reputation: 12905
Run this step by step to understand, this should give you the intended result, I think. Also, there is room for reducing the code if you're up for it -
library(data.table)
library(reshape)
library(reshape2)
events <- data.table(events)
eventTime <- data.table(eventTime)
eventTime[,TimeLimit := time]
setkeyv(eventTime,c('game_id','time'))
setkeyv(events,c('game_id','time'))
eventsSubset <- eventTime[events, roll = -Inf][!is.na(TimeLimit)]
eventsSubset <- eventsSubset[,list(Freq = .N), by = c('team','event_id','game_id','TimeLimit')]
eventsReshaped <- cast(eventsSubset, game_id + TimeLimit ~ event_id+team, fun.aggregate = sum, value = "Freq")
Output
> eventsReshaped
game_id TimeLimit C_1 C_2 G_2
1 1 45 2 1 0
2 2 65 2 2 1
PS- This assumes that in your whole dataset, each type of event will occur at least once. The output of this specific code aggregates over only the events found which is why the result doesn't have all event-team combinations. You can add dummy entries to your original dataset, if you want to make sure this doesn't happen.
Upvotes: 2
Reputation: 226801
It will help to rename the event time to something different from "time":
names(eventTime)[2] <- "stopTime"
Merge stopping time into main data set:
events <- merge(events,eventTime)
Load useful packages:
library(reshape2)
library(plyr)
Subset to retain the events before the stopping time:
e2 <- subset(events,time<stopTime)
Create a table of game * team * event and 'melt' it to long form:
m2 <- melt(with(e2,table(game_id,team,event_id)))
Rearrange to your preferred wide form:
m3 <- dcast(m2,game_id~team+event_id)
Put the stopping time back into the results:
merge(eventTime,m3)
Upvotes: 1