Reputation: 1573
I would like to summarize the following sample data into a new dataframe as follows:
Population, Sample Size (N), Percent Completed (%)
Sample Size is a count of all records for each population. I can do this using the table command or tapply. Percent completed is the percentage of records with 'End Date's (all records without 'End Date' are assumed to not complete. This is where I am lost!
Sample Data
sample <- structure(list(Population = structure(c(1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 3L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 1L,
1L, 2L, 2L, 3L, 3L, 3L, 3L, 1L, 1L, 3L, 3L, 3L, 3L), .Label = c("Glommen",
"Kaseberga", "Steninge"), class = "factor"), Start_Date = structure(c(16032,
16032, 16032, 16032, 16032, 16036, 16036, 16036, 16037, 16038,
16038, 16039, 16039, 16039, 16039, 16039, 16039, 16041, 16041,
16041, 16041, 16041, 16041, 16044, 16044, 16045, 16045, 16045,
16045, 16048, 16048, 16048, 16048, 16048, 16048), class = "Date"),
End_Date = structure(c(NA, 16037, NA, NA, 16036, 16043, 16040,
16041, 16042, 16042, 16042, 16043, 16043, 16043, 16043, 16043,
16043, 16045, 16045, 16045, 16045, 16045, NA, 16048, 16048,
16049, 16049, NA, NA, 16052, 16052, 16052, 16052, 16052,
16052), class = "Date")), .Names = c("Population", "Start_Date",
"End_Date"), row.names = c(NA, 35L), class = "data.frame")
Upvotes: 0
Views: 1042
Reputation: 81693
It's easy with the plyr
package:
library(plyr)
ddply(sample, .(Population), summarize,
Sample_Size = length(End_Date),
Percent_Completed = mean(!is.na(End_Date)) * 100)
# Population Sample_Size Percent_Completed
# 1 Glommen 13 69.23077
# 2 Kaseberga 7 100.00000
# 3 Steninge 15 86.66667
Upvotes: 2
Reputation: 44330
You can do this with split/apply/combine:
spl = split(sample, sample$Population)
new.rows = lapply(spl, function(x) data.frame(Population=x$Population[1],
SampleSize=nrow(x),
PctComplete=sum(!is.na(x$End_Date))/nrow(x)))
combined = do.call(rbind, new.rows)
combined
# Population SampleSize PctComplete
# Glommen Glommen 13 0.6923077
# Kaseberga Kaseberga 7 1.0000000
# Steninge Steninge 15 0.8666667
One word of warning: sample
is the name of a base function, so you should pick a different name for your data frame.
Upvotes: 2