Reputation: 3
Suppose I have following dataset:
id <- c("vp01","vp01","vp01","vp01", "vp02", "vp02","vp02","vp02","vp02", "vp02")
rating <- c("0","1","0","0","1","0","1", "0", "0", "0")
Au1 <- c("150.0","100.45","80.23","133.21","94.33","102.22", "83.45", "122.65", "115.41", "109.34")
df <- data.frame(id,rating,Au1)
id rating Au1
1 vp01 0 150.0
2 vp01 1 100.45
3 vp01 0 80.23
4 vp01 0 133.21
5 vp02 1 94.33
6 vp02 0 102.22
7 vp02 1 83.45
8 vp02 0 122.65
9 vp02 0 115.41
10 vp02 0 109.34
In my dataset, there are many more 0 than 1. I would like to remove the rows in a way that within each id the number of 0 ratings matches the number of 1 ratings. The number of rows per id can differ. The result could look like this.
id rating Au1
1 vp01 0 150.0
2 vp01 1 100.45
5 vp02 1 94.33
6 vp02 0 102.22
7 vp02 1 83.45
8 vp02 0 122.65
I have absolutely no idea how to tackle the problem, so any help is very much appreciated!
Upvotes: 0
Views: 87
Reputation: 389175
Add a count column for each id
and rating
, then for each id
select the count value which is minimum and for each id
and rating
select that minimum number of rows.
library(dplyr)
df %>%
add_count(id, rating) %>%
group_by(id) %>%
mutate(n = min(n)) %>%
group_by(rating, .add = TRUE) %>%
#In old dplyr add = TRUE
#group_by(rating, add = TRUE) %>%
sample_n(n) %>%
select(-n)
# id rating Au1
# <chr> <chr> <chr>
#1 vp01 0 133.21
#2 vp01 1 100.45
#3 vp02 0 102.22
#4 vp02 0 115.41
#5 vp02 1 83.45
#6 vp02 1 94.33
I am using sample_n
here to select any random n
rows for id
and rating
. This can be changed if you have any preferences. For example, we can select first n
rows, last n
rows, or rows based on value of Au1
column.
Upvotes: 3
Reputation: 5798
Base R solution:
# Create a grouping vector to split on in order to remove duplicates:
# group => character vector
df$group <- paste0(df$id, unlist(sapply(split(df, df$id), function(y){
cumsum(c(TRUE, abs(diff(as.numeric(y$rating)))))
}
)
)
)
# Remove data duplicated within groups: data.frame => stdout (console)
data.frame(do.call("rbind", lapply(split(df, df$group), function(x){
x[!duplicated(x$group), c("id", "rating", "Au1")]
}
)
),
row.names = NULL)
Output:
# id rating Au1
#1 vp01 0 150.0
#2 vp01 1 100.45
#3 vp01 0 80.23
#4 vp02 1 94.33
#5 vp02 0 102.22
#6 vp02 1 83.45
#7 vp02 0 122.65
Upvotes: 0