FiofanS
FiofanS

Reputation: 309

How to calculate average time for aggregated data per different groups?

I have the following data frame and this question relates to [this thread]

df = data.frame(c("2012","2012","2012","2013"),
                c("AAA","BBB","AAA","AAA"),
                c("X","Not-serviced","X","Y"),
                c("2","10","3","2.5"))

colnames(df) = c("year","type","service_type","waiting_time")

I want to get average waiting times for serviced and nonserviced groups. This is how the data is grouped:

library(data.table)
setDT(df)[, .(num_serviced = sum(service_type != "Not-serviced"), 
      num_notserviced = sum(service_type =="Not_serviced"),
      avg_wt = mean(waiting_time)), ## THE PROBLEM HERE!!!
     .(year, type)][, Total := num_serviced + num_notserviced][]

However avg_wt = mean(waiting_time)) estimated average waiting time over Total. I would rather need avg_wt_serviced and avg_wt_notserviced.

The result must be:

year  type num_serviced num_notserviced num_total avg_wt_serviced  avg_wt_notserviced
2012  AAA  2            0               2         2.5              0

Upvotes: 1

Views: 68

Answers (3)

timat
timat

Reputation: 1500

here it is: In your dataframe, the waiting time must be a number be able to use mean, see as.numeric() to convert it.

df = data.frame(c("2012","2012","2012","2013"),
                c("AAA","BBB","AAA","AAA"),
                c("X","Not-serviced","X","Y"),
                c(2,10,3,2.5))

colnames(df) = c("year","type","service_type","waiting_time")

library(data.table)
setDT(df)[, .(num_serviced = sum(service_type != "Not-serviced"), 
              num_notserviced = sum(service_type =="Not-serviced"),
              avg_wt_serviced = ifelse(service_type != "Not-serviced",mean(waiting_time),0),
              avg_wt_notserviced = ifelse(service_type == "Not-serviced",mean(waiting_time),0)), 
          .(year, type)][, Total := num_serviced + num_notserviced][]

Upvotes: 2

Ronak Shah
Ronak Shah

Reputation: 388817

With dplyr, we can use mean

library(dplyr)
df %>%
   group_by(year,type) %>%
   summarise(num_serviced = sum(service_type != "Not-serviced"), 
             num_notserviced = sum(service_type == "Not-serviced"),
             num_total = num_serviced + num_notserviced, 
             avg_wt_serv = mean(waiting_time[service_type != "Not-serviced"]),
             avg_wt_notser = mean(waiting_time[service_type == "Not-serviced"]))


#   year  type num_serviced num_notserviced num_total avg_wt_serv  avg_wt_notser
#   <fctr> <fctr>   <int>           <int>     <int>      <dbl>         <dbl>
#1   2012    AAA       2               0         2        2.5            NaN
#2   2012    BBB       0               1         1        NaN            10
#3   2013    AAA       1               0         1        2.5            NaN

Upvotes: 2

R.S.
R.S.

Reputation: 2140

The problem seem to lie in the quoted columns. Edit/Added: Because of quotes the colummn is being read as factor variable. See class(df$waiting_time)

Adding this line before computation gives the correct answer for me.

df$waiting_time<- as.numeric(as.character(df$waiting_time))

Upvotes: 0

Related Questions