Reputation: 2575

sampling based on specified column values in R

I have a data like this, where Average is the average of X, Y, and Z.

head(df)
ID  X   Y   Z   Average
A   2   2   5   3
A   4   3   2   3
A   4   3   2   3
B   5   3   1   3
B   3   4   2   3
B   1   5   3   3
C   5   3   1   3
C   2   3   4   3
C   5   3   1   3
D   2   3   4   3
D   3   2   4   3
D   3   2   4   3
E   5   3   1   3
E   4   3   2   3
E   3   4   2   3

To reproduce this, we can use

df <- data.frame(ID = c("A", "A", "A", "B", "B", "B", "C", "C", "C", "D", "D", "D", "E", "E", "E"),
                     X = c(2L, 4L, 4L, 5L, 3L,1L, 5L, 2L, 5L, 2L, 3L, 3L, 5L, 4L, 3L),
                     Y = c(2L, 3L, 3L, 3L,4L, 5L, 3L, 3L, 3L, 3L, 2L, 2L, 3L, 3L, 4L), 
                     Z = c(5L, 2L, 2L,1L, 2L, 3L, 1L, 4L, 1L, 4L, 4L, 4L, 1L, 2L, 2L), 
                     Average = c(3L,3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L))

From this, I want to extract one observation per ID such that we don't get same (as much as is possible) values of the combination of X, Y, and Z. I tried

library(dplyr)
df %>% sample_n(size = nrow(.), replace = FALSE) %>% distinct(ID, .keep_all = T)

But, on a larger dataset, I see too many repetitions of the combination of X, Y, Z. To the extent possible, I need the output with equal or close to equal representation of cases (i.e. the combination of X, Y, Y) like this:

   ID   X   Y   Z   Average
    A   2   2   5   3
    B   5   3   1   3
    C   2   3   4   3
    D   3   2   4   3
    E   4   3   2   3

Upvotes: 1

Answers (4)

Josh Pachner

Reputation: 521

I just want to add to the answer from lefft. I modified the code to allow sample sizes and not dependent on having a column called ID, or X,Y,Z, or any of that

id_sample <- function(df, id, field, sampleSize) {
  df = df %>%
   filter(!!as.symbol(field) == id)
  return(df[sample(1:nrow(df), size = sampleSize, replace = TRUE),])
}

  make_sample_df <- function(dat, sampleSize, field) {
   # initialize empty data frame to fill with row samples
  out <- dat[NULL,]
  # get the unique id's to loop over
  ids <- unique(dat[[field]])
   for (id in ids) {
    # grab a sample
    id_row <- id_sample(dat, id, field,sampleSize)
   out <- rbind(out, id_row)
     }
  return(out)
  }

And then how I use this is

   sample_df = make_sample_df(df, 20, "ColumnToSampleOn")

where df is my original dataframe, 20 is the count of rows I want for each unique value found in the "ColumnToSampleOn"

Upvotes: 1

Gregor Thomas

Reputation: 146120

This seems dubious, but try this:

library(dplyr)
df %>% add_count(X, Y, Z) %>%
    group_by(ID) %>%
    top_n(-1, n) %>%
    arrange(runif(n)) %>%
    select(-n) %>%
    slice(1)
# # A tibble: 5 x 5
# # Groups:   ID [5]
#       ID     X     Y     Z Average
#   <fctr> <int> <int> <int>   <int>
# 1      A     2     2     5       3
# 2      B     1     5     3       3
# 3      C     2     3     4       3
# 4      D     3     2     4       3
# 5      E     3     4     2       3

This picks the least common XYZ combo from each ID, and in case of a tie chooses randomly. Extremely common XYZ combos may be missing entirely.

Upvotes: 2

lefft

Reputation: 2105

Here's one way to do it. Explanations inline. Note that the implementation can be tweaked depending on the desired notion of "similarity"/"duplication" between rows.

# get the sample data from the original post
dat <- data.frame(
  ID = c("A","A","A","B", "B", "B", "C", "C", "C", "D", "D", "D", "E", "E", "E"),
  X = c(2L, 4L, 4L, 5L, 3L,1L, 5L, 2L, 5L, 2L, 3L, 3L, 5L, 4L, 3L),
  Y = c(2L, 3L, 3L, 3L,4L, 5L, 3L, 3L, 3L, 3L, 2L, 2L, 3L, 3L, 4L),
  Z = c(5L, 2L, 2L,1L, 2L, 3L, 1L, 4L, 1L, 4L, 4L, 4L, 1L, 2L, 2L),
  Average = c(3L,3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L))

Define a function that samples a row of a data frame given an id (assumes $ID column exists):

# function to get a randomly sampled row from `df` with `df$ID==id`
id_sample <- function(df, id){
  df <- df[df$ID==id, ]
  return(df[sample(1:nrow(df), size=1), ])
}

Define a function that will loop over each id, rejecting a sample if it is "too similar" to the rows already in the data frame:

make_sample_df <- function(dat, threshold){

  # initialize empty data frame to fill with row samples
  out <- dat[NULL, ]

  # get the unique id's to loop over
  ids <- unique(dat$ID)

  for (id in ids){

    # grab a sample
    id_row <- id_sample(dat, id)

    # see how many of its elements have column-duplicates (zero for first id)
    n_dupes <- sum(apply(out, 1, function(row){
      sum(row[1]==id_row$X, row[2]==id_row$Y, row[3]==id_row$Z)}))

    # keep getting samps if the number of duplicates is higher than threshold
    while (n_dupes > threshold){

      id_row <- id_sample(dat, id)

      n_dupes <- sum(apply(out, 1, function(row){
        sum(row[1]==id_row$X, row[2]==id_row$Y, row[3]==id_row$Z)}))
    }

    # once we have a suitable row for `id`, add it to the output df
    out <- rbind(out, id_row)
  }

  return(out)
}

Now apply the func to OP's data and inspect:

# rows at most 1 of whose values appear in another row (at same column)     
set.seed(6933)
make_sample_df(dat, threshold=1)

## ID X Y Z Average
## A  4 3 2       3
## B  1 5 3       3
## C  2 3 4       3
## D  3 2 4       3
## E  5 3 1       3

Depending on your needs, you can try different threshold values, though note that if you're too strict about threshold, the while loop could keep running forever, so you might want to put some kind of escape hatch in there.

You can also adapt this strategy to different variations, e.g. where you care about values getting duplicated within rows, instead of columns.

Hope this helps ~~

Upvotes: 1

MKR

Reputation: 20095

One possible solution could be by excluding values which has been sampled for previous column.

library(dplyr)
df %>% group_by(ID) %>%
  summarise(XX = sample(unique(X),1),
            YY = sample(unique(Y[Y!=XX]),1),
            ZZ = sample(unique(Z[Z!=XX & Z!=YY]),1),
            Average = mean(Average))
# # A tibble: 5 x 5
#    ID        XX    YY    ZZ Average
#   <fctr> <int> <int> <int>   <dbl>
# 1 A          4     2     1    3.00
# 2 B          5     4     2    3.00
# 3 C          5     3     1    3.00
# 4 D          2     3     2    3.00
# 5 E          5     4     2    3.00

I have changed column names just make the logic clear.

Upvotes: 0

sampling based on specified column values in R

Answers (4)

Related Questions