Papaeya
Papaeya

Reputation: 3

reduce dataframe to get equal amounts of rows per value (by group)

Suppose I have following dataset:

id <- c("vp01","vp01","vp01","vp01", "vp02", "vp02","vp02","vp02","vp02", "vp02")
rating <- c("0","1","0","0","1","0","1", "0", "0", "0")
Au1 <- c("150.0","100.45","80.23","133.21","94.33","102.22", "83.45", "122.65", "115.41", "109.34")

df <- data.frame(id,rating,Au1)


     id rating    Au1
1  vp01      0  150.0
2  vp01      1 100.45
3  vp01      0  80.23
4  vp01      0 133.21
5  vp02      1  94.33
6  vp02      0 102.22
7  vp02      1  83.45
8  vp02      0 122.65
9  vp02      0 115.41
10 vp02      0 109.34

In my dataset, there are many more 0 than 1. I would like to remove the rows in a way that within each id the number of 0 ratings matches the number of 1 ratings. The number of rows per id can differ. The result could look like this.

     id rating    Au1
1  vp01      0  150.0
2  vp01      1 100.45
5  vp02      1  94.33
6  vp02      0 102.22
7  vp02      1  83.45
8  vp02      0 122.65

I have absolutely no idea how to tackle the problem, so any help is very much appreciated!

Upvotes: 0

Views: 87

Answers (2)

Ronak Shah
Ronak Shah

Reputation: 389175

Add a count column for each id and rating, then for each id select the count value which is minimum and for each id and rating select that minimum number of rows.

library(dplyr)

df %>%
  add_count(id, rating) %>%
  group_by(id) %>%
  mutate(n = min(n)) %>%
  group_by(rating, .add = TRUE) %>%
  #In old dplyr add = TRUE
  #group_by(rating, add = TRUE) %>%
  sample_n(n) %>%
  select(-n)


#  id    rating Au1   
#  <chr> <chr>  <chr> 
#1 vp01  0      133.21
#2 vp01  1      100.45
#3 vp02  0      102.22
#4 vp02  0      115.41
#5 vp02  1      83.45 
#6 vp02  1      94.33 

I am using sample_n here to select any random n rows for id and rating. This can be changed if you have any preferences. For example, we can select first n rows, last n rows, or rows based on value of Au1 column.

Upvotes: 3

hello_friend
hello_friend

Reputation: 5798

Base R solution:

# Create a grouping vector to split on in order to remove duplicates:
# group => character vector
df$group <- paste0(df$id, unlist(sapply(split(df, df$id), function(y){
        cumsum(c(TRUE, abs(diff(as.numeric(y$rating)))))
      }
    )
  )
)

# Remove data duplicated within groups: data.frame => stdout (console)
data.frame(do.call("rbind", lapply(split(df, df$group), function(x){
        x[!duplicated(x$group), c("id", "rating", "Au1")]
      }
    )
  ),
row.names = NULL)

Output:

#    id rating    Au1
#1 vp01      0  150.0
#2 vp01      1 100.45
#3 vp01      0  80.23
#4 vp02      1  94.33
#5 vp02      0 102.22
#6 vp02      1  83.45
#7 vp02      0 122.65

Upvotes: 0

Related Questions