Reputation: 41
I have a dataset which contains two columns, user_type, and lag response time (in days):
user_type imp_date lag
Consumer 20130613 1
Consumer 20130612 2
Consumer 20130611 3
Consumer 20130612 3
Producer 20130610 10
Producer 20130614 5
Producer 20130613 7
I would like to calculate for the percentage break down of lag for EACH user_type. Here is an example of the output I would like:
user_type lag percentage
--------- --- ----------
Consumer 1 0.25
Consumer 2 0.25
Consumer 3 0.5
Producer 5 0.333
Producer 7 0.333
Producer 10 0.333
The percentage breakdown of lag time response is calculated with respect to the total of each user_type group.
Specifically, I would like to use ddply in pylr, and I have something along the line like:
a = ddply(data, .(user_type), summarize, table(lag)/length(lag))
but it's not giving me the lag time response column.
p.s. My original motivation was to plot these lag distribution for different user type, and I have:
p <- ggplot(data, aes(x = lag, fill = factor(user_type)))
p + geom_bar(aes(y = (..count..)/sum(..count..)))
but it seems like the percentage breakdown for lag for each user_type is incorrect (i.e. The percentage was calculated with respect to each of the lag group, not user_type group). As a result, I decided to transform my dataset before plotting, if there is an easier way, please share.
Thanks!
Upvotes: 4
Views: 6184
Reputation: 78590
This could be done using ddply with:
a = ddply(data, .(user_type), function(d) {
data.frame(table(d$lag)/length(d$lag))
})
Though I would probably use the data.table
package, like so:
library(data.table)
d = data.table(data)
a = d[, list(lag=unique(lag), percentage=as.numeric(table(lag)/length(lag))), by="user_type"]
Upvotes: 3