Reputation: 3195
Here is a sample of my data:
kod <- structure(list(ID_WORKES = c(28029571L, 28029571L, 28029571L,
28029571L, 28029571L, 28029571L, 28029571L, 28029571L, 28029571L
), TABL_NOM = c(9716L, 9716L, 9716L, 9716L, 9716L, 9716L, 9716L,
9716L, 9716L), NAME = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L), .Label = "Dim", class = "factor"), ID_SP_NAR = c(20L,
20L, 20L, 30L, 30L, 30L, 30L, 30L, 30L), KOD_DOR = c(28L, 28L,
28L, 28L, 28L, 28L, 28L, 28L, 28L), KOD_DEPO = c(9167L, 9167L,
9167L, 9167L, 9167L, 9167L, 9167L, 9167L, 9167L), COLUMN_MASH = c(13L,
13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L), prop_violations = c(0.00561797752808989,
0.00293255131964809, 0.00495049504950495, 0.00215982721382289,
0.0120481927710843, 0.00561797752808989, 0.00293255131964809,
0.00591715976331361, 0.00495049504950495), mash_score = c(0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L)), row.names = c(NA, -9L), class = "data.frame")
W
hat I would like to achieve is the following:
for each group formed by the columns ID_WORKES
, TABL_NOM
, NAME
, KOD_DOR
and KOD_DEPO
, I would like to have a unique value in ID_SP_NAR
.
For example here we have six rows where ID_SP_NAR == 30
with different values for prop_violations
.
In this case I would like to summarise these six rows such that the remaining value for prop_violations
equals the mean for these six rows.
The desired output then would look like:
ID_WORKES TABL_NOM NAME KOD_DOR KOD_DEPO ID_SP_NAR prop_violations mash_score
1 28029571 9716 Dim 28 9167 20 0.004500341 0
2 28029571 9716 Dim 28 9167 30 0.005604367 0
But there is one more thing: if for some duplicate values in prop_violations for ID_SP_NAR, mash_ score has value >0 then left last value for which mash_score has value>0
for example.
ID_WORKES TABL_NOM NAME ID_SP_NAR KOD_DOR KOD_DEPO COLUMN_MASH prop_violations mash_score
1 28029571 9716 Dim 30 28 9167 13 0,0056 0
2 28029571 9716 Dim 30 28 9167 13 0,012048193 0
3 28029571 9716 Dim 30 28 9167 13 0,005617978 0
4 28029571 9716 Dim 30 28 9167 13 0,002932551 1
5 28029571 9716 Dim 30 28 9167 13 0,00591716 0
6 28029571 9716 Dim 30 28 9167 13 0,004950495 0
in such case will left only value 0,002932551 by prop_violation for ID_SP_NAR=30,cause mash_score>0 How reach this conditions?
Upvotes: 1
Views: 77
Reputation: 25225
An option using data.table
:
setDT(kod)
kod[, {
if(any(mash_score)>0) {
i <- which(mash_score>0)[1L]
.(prop_violations=prop_violations[i], mash_score=mash_score[i])
} else
.(prop_violations=mean(prop_violations), mash_score=mash_score[1L])
},
.(ID_WORKES, TABL_NOM, NAME, KOD_DOR, KOD_DEPO, ID_SP_NAR)]
output:
ID_WORKES TABL_NOM NAME KOD_DOR KOD_DEPO ID_SP_NAR prop_violations mash_score
1: 28029571 9716 Dim 28 9167 20 0.004500341 0
2: 28029571 9716 Dim 28 9167 30 0.002932551 1
data:
kod <- structure(list(ID_WORKES = c(28029571L, 28029571L, 28029571L,
28029571L, 28029571L, 28029571L, 28029571L, 28029571L, 28029571L
), TABL_NOM = c(9716L, 9716L, 9716L, 9716L, 9716L, 9716L, 9716L,
9716L, 9716L), NAME = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L), .Label = "Dim", class = "factor"), ID_SP_NAR = c(20L,
20L, 20L, 30L, 30L, 30L, 30L, 30L, 30L), KOD_DOR = c(28L, 28L,
28L, 28L, 28L, 28L, 28L, 28L, 28L), KOD_DEPO = c(9167L, 9167L,
9167L, 9167L, 9167L, 9167L, 9167L, 9167L, 9167L), COLUMN_MASH = c(13L,
13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L), prop_violations = c(0.00561797752808989,
0.00293255131964809, 0.00495049504950495, 0.00215982721382289,
0.0120481927710843, 0.00561797752808989, 0.00293255131964809,
0.00591715976331361, 0.00495049504950495), mash_score = c(0L,
0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L)), row.names = c(NA, -9L), class = "data.frame")
Upvotes: 3
Reputation: 11981
here is a solution using the tidyverse
package:
kod %>%
group_by(ID_WORKES, TABL_NOM, NAME, KOD_DOR, KOD_DEPO, ID_SP_NAR) %>%
summarise(prop_violations = if (all(mash_score == 0)) mean(prop_violations) else last(prop_violations[mash_score > 0]))
If for a specific group all mash_score
are equal to zero the average is returned (using mean
). If at least one mash_score
is larger than zero then the last value of prop_violations
for which mash_score > 0
is returned (using dplyr::last
).
Upvotes: 2