Reputation: 828
I have a set of data like this
project,file,line,mutant,killstatus,commitid,isbug
p1,f1,100,NMC,killed,abc123,yes
p1,f1,100,VMC,alive,abc123,yes
p1,f1,200,NVM,alive,ab12de,no
p1,f1,200,NVM,alive,abcde1,yes
p1,f1,200,NVM,alive,abcde2,yes
I would like to group by project,file,line and summarize it as
project,file,line,total.mutants,killed.mutants,total.commits,bugfix.commits
p1,f1,100,2,1,1,1
p1,f1,200,1,0,3,2
That is, for each project,file,line
combination, the length of total.mutants
is the unique count of (mutant,killstatus)
, and killed.mutants
is the count of killed
in the previous pair. similarly, total.commits
is the unique count of (commitid,isbug)
and bugfix.commits
is the count of yes
in the previous pair.
To clarify:
For p1,f1,100
there are two unique mutant,killstatus pairs (NMC,killed),(VMC,alive)
of which one is killed. Hence 2,1
for total, killed
For p1,f1,200
there is only one unique mutant,killstatus pair (NVM,alive)
which is alive. Hence 1,0
for total,killed
For p1,f1,100
there is one unique commitid,isbug pair (abc123,yes)
which is a bug. Hence 1,1
for total,bugfix
For p1,f1,200
there are three unique commitid,isbug pairs (ab12de,no),(abcde1,yes),(abcde2,yes)
of which two are bugs. Hence 3,2
for total,bugfix
Can I use dplyr to summarize it in one go? I managed to do the first partially like
data %>% group_by(project,file,line) %>% summarize(
total.mutants = length(killstatus),
killed.mutants = sum(ifelse(killstatus == 'alive', F, T))
)
but I don't know how to do the unique part, and second part.
Upvotes: 1
Views: 134
Reputation: 886938
We can try
library(dplyr)
library(data.table)
df1 %>%
group_by(project,file,line) %>%
transmute(temp1 = paste(mutant, killstatus),
total.mutants= uniqueN(temp1),
killed.mutants= uniqueN(paste(mutant[killstatus=='killed'],
killstatus[killstatus=='killed'])),
temp2=paste(commitid, isbug),
total.commits= uniqueN(temp2),
bug.commits= uniqueN(paste(commitid[isbug=="yes"],
isbug[isbug=="yes"])) ) %>%
distinct(., project, file, line) %>%
select(-temp1, -temp2)
# project file line total.mutants killed.mutants total.commits bug.commits
# (chr) (chr) (int) (int) (int) (int) (int)
#1 p1 f1 100 2 1 1 1
#2 p1 f1 200 1 0 3 2
Or a slightly more compact version with unite
from tidyr
library(tidyr)
df1 %>%
unite(temp1,mutant, killstatus) %>%
unite(temp2, commitid, isbug) %>%
group_by(project, file, line) %>%
summarise(total.mutants= n_distinct(temp1),
killed.mutants= uniqueN(temp1[grepl("killed", temp1)]),
total.commits= n_distinct(temp2),
bug.commits= uniqueN(temp2[grepl("yes", temp2)]))
# project file line total.mutants killed.mutants total.commits bug.commits
# (chr) (chr) (int) (int) (int) (int) (int)
#1 p1 f1 100 2 1 1 1
#2 p1 f1 200 1 0 3 2
Upvotes: 1