Multilevel summarize with dplyr in R

Question

I have a set of data like this

project,file,line,mutant,killstatus,commitid,isbug
p1,f1,100,NMC,killed,abc123,yes
p1,f1,100,VMC,alive,abc123,yes
p1,f1,200,NVM,alive,ab12de,no
p1,f1,200,NVM,alive,abcde1,yes
p1,f1,200,NVM,alive,abcde2,yes

I would like to group by project,file,line and summarize it as

project,file,line,total.mutants,killed.mutants,total.commits,bugfix.commits
p1,f1,100,2,1,1,1
p1,f1,200,1,0,3,2

That is, for each project,file,line combination, the length of total.mutants is the unique count of (mutant,killstatus), and killed.mutants is the count of killed in the previous pair. similarly, total.commits is the unique count of (commitid,isbug) and bugfix.commits is the count of yes in the previous pair.

To clarify: For p1,f1,100 there are two unique mutant,killstatus pairs (NMC,killed),(VMC,alive) of which one is killed. Hence 2,1 for total, killed For p1,f1,200 there is only one unique mutant,killstatus pair (NVM,alive) which is alive. Hence 1,0 for total,killed

For p1,f1,100 there is one unique commitid,isbug pair (abc123,yes) which is a bug. Hence 1,1 for total,bugfix For p1,f1,200 there are three unique commitid,isbug pairs (ab12de,no),(abcde1,yes),(abcde2,yes) of which two are bugs. Hence 3,2 for total,bugfix

Can I use dplyr to summarize it in one go? I managed to do the first partially like

data %>% group_by(project,file,line) %>% summarize(
   total.mutants = length(killstatus),
   killed.mutants = sum(ifelse(killstatus == 'alive', F, T))
)

but I don't know how to do the unique part, and second part.

akrun · Accepted Answer

We can try

library(dplyr)
library(data.table)
df1 %>%
 group_by(project,file,line) %>% 
  transmute(temp1 = paste(mutant, killstatus),

          total.mutants= uniqueN(temp1),
          killed.mutants= uniqueN(paste(mutant[killstatus=='killed'],
                            killstatus[killstatus=='killed'])),
         temp2=paste(commitid, isbug), 
         total.commits= uniqueN(temp2), 
         bug.commits= uniqueN(paste(commitid[isbug=="yes"], 
                            isbug[isbug=="yes"])) ) %>%
         distinct(., project, file, line) %>%
         select(-temp1, -temp2) 
#     project  file  line total.mutants killed.mutants total.commits bug.commits
#    (chr) (chr) (int)         (int)          (int)         (int)       (int)
#1      p1    f1   100             2              1             1           1
#2      p1    f1   200             1              0             3           2

Or a slightly more compact version with unite from tidyr

library(tidyr)
df1 %>% 
    unite(temp1,mutant, killstatus) %>%
    unite(temp2, commitid, isbug) %>%
    group_by(project, file, line) %>% 
    summarise(total.mutants= n_distinct(temp1), 
               killed.mutants= uniqueN(temp1[grepl("killed", temp1)]),
               total.commits= n_distinct(temp2),
               bug.commits= uniqueN(temp2[grepl("yes", temp2)]))
#     project  file  line total.mutants killed.mutants total.commits bug.commits
#     (chr) (chr) (int)         (int)          (int)         (int)       (int)
#1      p1    f1   100             2              1             1           1
#2      p1    f1   200             1              0             3           2

Multilevel summarize with dplyr in R

Answers (1)

Related Questions