Randomly sample rows of data frame with given weights (frequency)

Question

I have a data frame in the following format: one column with raw sequences, another column with the number of times a sequence occurs, and other columns with other characteristics.

c1 <- c(324, 213, 122, 34)
c2 <- c("SDOIHHFOEKN", "SDIUFONBSD", "DSLIHFEIHDFS", "DOOIUDBD")
c3 <- c("G", "T", "U", "T")

df <- data.frame(count = c1, seq = c2, other = c3)

My actual data frame has over 10^6 rows and 20 columns.

I want to randomly sample N sequences from this, while maintaining the data frame structure as above. For example, I want to randomly sample 300 sequences from the above data frame. Theoretically, the ratio of the four unique sequences present here should be retained in the final data frame.

How can this random sampling happen? I was thinking of using reshape::untable to expand the data frame and then use a random number generator and grep to get the rows, but then I cannot get it back into the initial data frame format with each row having a unique sequence and the count of how many times the sequence shows up.

Thanks!

thc · Accepted Answer

Use sample.int for speed:

sampled_df <- df[sample.int(nrow(df), 300, replace = TRUE, prob = df$count),] %>% 
group_by(seq) %>% 
summarize(count = n(), other=unique(other)) %>% 
as.data.frame

> sampled_df
           seq count other
1     DOOIUDBD    21     T
2 DSLIHFEIHDFS    53     U
3   SDIUFONBSD   102     T
4  SDOIHHFOEKN   124     G

Randomly sample rows of data frame with given weights (frequency)

Answers (1)

Related Questions