Lee
Lee

Reputation: 369

Random sampling one row within each id

I have data like this:

data<-data.frame(id=c(1,1,1,1,2,2,2,3,3,3,4,4,4),
                 yearmonthweek=c(2012052,2012053,2012061,2012062,2013031,2013052,2013053,2012052,
                                 2012053,2012054,2012071,2012073,2012074),
                 event=c(0,1,1,0,0,1,0,0,0,0,0,0,0),
                 a=c(11,12,13,10,11,12,15,14,13,15,19,10,20))

id stands for personal id. yearmonthweek means year, month and week. I want to clean data by the following rules. First, find id that have at least one event. In this case id=1 and 2 have events and id=3 and 4 have no events. Secondly, pick a random row from an id that has events and pick a random row from an id that has no events. So, the number of rows should be same as the number of id. My expected output looks like this:

data<-data.frame(id=c(1,2,3,4),
                 yearmonthweek=c(2012053,2013052,2012052,2012073),
                 event=c(1,1,0,0),
                 a=c(12,12,14,10))

Since I use random sampling, the values can be different as above, but there should be 4 rows like this.

Upvotes: 0

Views: 192

Answers (3)

jblood94
jblood94

Reputation: 16981

With data.table:

library(data.table)

set.seed(1)
setorder(
  unique(
    setorder(
      setDT(data)[
        , idx := .I # add an index column to re-sort later
      ][
        sample(nrow(data)) # randomize the table
      ],
      -event # sort descending by event
    ),
    by = "id" # get unique rows by id
  ),
  idx # re-sort
)[, idx := NULL][] # remove the index column
#>    id yearmonthweek event  a
#> 1:  1       2012053     1 12
#> 2:  2       2013052     1 12
#> 3:  3       2012053     0 13
#> 4:  4       2012071     0 19

Upvotes: 0

Maurits Evers
Maurits Evers

Reputation: 50678

Here is an option

set.seed(2022)
data %>%
    group_by(id) %>%
    mutate(has_event = any(event == 1)) %>%
    filter(if_else(has_event, event == 1, event == 0)) %>%
    slice_sample(n = 1) %>%
    select(-has_event) %>%
    ungroup()
## A tibble: 4 × 4
#     id yearmonthweek event     a
#  <dbl>         <dbl> <dbl> <dbl>
#1     1       2012061     1    13
#2     2       2013052     1    12
#3     3       2012053     0    13
#4     4       2012074     0    20

Explanation: Group by id, flag if a group has at least one event; if it does, only keep those rows where event == 1; then uniform-randomly select a single row using slice_sample per group.

Upvotes: 1

Rui Barradas
Rui Barradas

Reputation: 76402

Here is a dplyr way in two steps.

data <- data.frame(id=c(1,1,1,1,2,2,2,3,3,3,4,4,4),
                 yearmonthweek=c(2012052,2012053,2012061,2012062,2013031,2013052,2013053,2012052,
                                 2012053,2012054,2012071,2012073,2012074),
                 event=c(0,1,1,0,0,1,0,0,0,0,0,0,0),
                 a=c(11,12,13,10,11,12,15,14,13,15,19,10,20))

suppressPackageStartupMessages(
  library(dplyr)
)

bind_rows(
  data %>%
    filter(event != 0) %>%
    group_by(id) %>%
    sample_n(size = 1),
  data %>%
    group_by(id) %>%
    mutate(event = !all(event == 0)) %>%
    filter(!event) %>%
    sample_n(size = 1)
)
#> # A tibble: 4 × 4
#> # Groups:   id [4]
#>      id yearmonthweek event     a
#>   <dbl>         <dbl> <dbl> <dbl>
#> 1     1       2012061     1    13
#> 2     2       2013052     1    12
#> 3     3       2012054     0    15
#> 4     4       2012071     0    19

Created on 2022-10-21 with reprex v2.0.2

Upvotes: 1

Related Questions