Reputation: 1266
My data looks more or less like this (with in the column car being car ID's, so each line represents one car, and the period during it was active):
structure(list(car = c(100987, 100995, 104641, 120350, 123444, 128248, 128253, 129537, 129612, 130189),
location = structure(c(1L, 1L, 1L, 2L, 3L, 4L, 3L, 3L, 3L, 3L), .Label = c("A", "A", "B", "D", "C", "C"), class = "factor"),
start = structure(c(12784, 12784, 365, 15027, 15951, -1096, 7305, 365, 365, -731), class = "Date"),
end = structure(c(16070, 16070, 16070, 16070, 16070, 13725, 16070, 16070, 16070, 16070), class = "Date")),
.Names = c("car", "location", "start", "end"), row.names = c(NA, 10L), class = "data.frame")
I would like to obtain a dataframe as follows:
output=data.frame(location=rep(c("A","B","C"),each=2),year=rep(2000:2001,3),n_cars=10:15)
So I would like to count, per location, how many cars there are over a range of years (say 2000:2013). A car only counts if it was there on the 31st of december of that year. I thought about making for each entry a list of years during which that car was active and then make the count of list items that contain that year. But my dataframe is too big to make that an efficient approach. How would I do this best? It seems an easy thing to do but I can't figure it out.
Upvotes: 0
Views: 72
Reputation: 3622
Something like this?
library(lubridate)
library(plyr)
library(gdata)
# dat is the name of your original data set structure
yeardf <- data.frame()
for(i in 1:nrow(dat)){
if(yday(dat$end[i]) > 364)
years_active <- paste(seq(year(dat$start)[i], year(dat$end)[i]))
else
years_active <- paste(seq(year(dat$start)[i], year(dat$end)[i]-1))
car <- paste(rep(dat$car[i], length(years_active)))
location <- as.character(rep(dat$location[i], length(years_active)))
df <- data.frame(car = car, location = location, year = years_active)
yeardf <- rbind(yeardf, df)
}
output <- ddply(yeardf, .(location, year), 'nrow')
output <- rename.vars(output, 'nrow', 'n_cars')
Upvotes: 1