Reputation: 39
I have a problem that I encounter regularly, and I need a more efficient way of dealing with. I have a messy solution that is posted below.
First, I'll generate some example data that is similar to my datasets.
a <- c(1, 2, 2, 2, 3, 3)
b <- c("10/12", "10/12", "10/12", "10/13", "10/12", "10/12")
c <- c("c", "c", "pv", "c", "c", "c")
data <- matrix(NA, nrow = 6, ncol = 3)
data[,1] <- a
data[,2] <- b
data[,3] <- c
data
[,1] [,2] [,3]
[1,] 1 10/12 c
[2,] 2 10/12 c
[3,] 2 10/12 pv
[4,] 2 10/13 c
[5,] 3 10/12 c
[6,] 3 10/12 c
# [,1] is a unique identifier, [,2] is a date, and [,3] is a type of occurrance
What I need to do is generate a table that includes only one entry for each ID for each day with a column showing whether that entry corresponds to 'c' only, 'pv' only, 'c & pv', or 'multiple c'. Multiple pvs are not possible in the data
The way I have done this is using a nested for loop:
# I generate an object to post the data to
output.temp <- matrix(NA, nrow = 1, ncol = 4)
# Then I define the outer loop that subsets the data over each ID
ids <- unique(data[,1])
n.ids <- length(ids)
for(i in 1:n.ids){
temp.data <- subset(data, data[,1] == ids[i])
dates <- unique(temp.data[,2])
n.dates <- length(dates)
# Then I define the inner loop that subsets the data for each ID over each date
for(j in 1: n.dates){
date.data <- subset(temp.data, temp.data[,2] == dates[j])
# Then I apply the logic of what to write out
if(nrow(date.data) == 1){
if(date.data[,3] == 'c'){
new.row <- cbind(date.data, "c only")
output.temp <- rbind(output.temp, new.row)
}
if(date.data[,3] == 'pv'){
new.row <- cbind(date.data, "pv only")
output.temp <- rbind(output.temp, new.row)
}
}
if(nrow(date.data) > 1){
if('pv' %in% date.data[,3]){
new.row <- cbind(matrix(date.data[1,], nrow = 1), c("c & pv"))
output.temp <- rbind(output.temp, new.row)
}
else{
new.row <- cbind(matrix(date.data[1,], nrow = 1), " multiple c only")
output.temp <- rbind(output.temp, new.row)
}
}
}
}
# Finally, I drop the unnecessary row and column from the output object
output.final <- output.temp[-1,-3]
This works, but it is terribly inefficient. As my datasets become larger (approaching 1 million rows), it becomes more and more of a problem.
Since I am really new to R and have little experience with programming, any advice on an alternate strategy would be greatly appreciated.
Upvotes: 3
Views: 103
Reputation: 12905
You should be able to use the code below to get to the exact format of output you require.
dataset <- data.table(dataset)
setnames(dataset, c('id','day','occurrence'))
dataset[,list(noofc = table(occurrence)['c'], noofpv = table(occurrence)['pv']), by = c('id','day')]
data.table
s are very efficient data frames and should help with your data size problem as well
Upvotes: 1
Reputation: 9405
I think this ddply()
solution should work for you:
library(plyr)
data <- data.frame(data)
names(data) <- c("id","date","type")
get.type <- function(x) ifelse("c" %in% x & "pv" %in% x, "c & pv",
ifelse(sum("c" == x) > 1,"multiple c",
ifelse("c" %in% x,"c",
ifelse("pv" %in% x,"pv","other"))))
ddply(data,.(id,date),summarize,type=get.type(type))
id date type
1 1 10/12 c
2 2 10/12 c & pv
3 2 10/13 c
4 3 10/12 multiple c
Upvotes: 0