Reputation: 1349
I have a data frame in the following format: one column with raw sequences, another column with the number of times a sequence occurs, and other columns with other characteristics.
c1 <- c(324, 213, 122, 34)
c2 <- c("SDOIHHFOEKN", "SDIUFONBSD", "DSLIHFEIHDFS", "DOOIUDBD")
c3 <- c("G", "T", "U", "T")
df <- data.frame(count = c1, seq = c2, other = c3)
My actual data frame has over 10^6 rows and 20 columns.
I want to randomly sample N sequences from this, while maintaining the data frame structure as above. For example, I want to randomly sample 300 sequences from the above data frame. Theoretically, the ratio of the four unique sequences present here should be retained in the final data frame.
How can this random sampling happen? I was thinking of using reshape::untable
to expand the data frame and then use a random number generator and grep to get the rows, but then I cannot get it back into the initial data frame format with each row having a unique sequence and the count of how many times the sequence shows up.
Thanks!
Upvotes: 1
Views: 1854
Reputation: 9705
Use sample.int for speed:
sampled_df <- df[sample.int(nrow(df), 300, replace = TRUE, prob = df$count),] %>%
group_by(seq) %>%
summarize(count = n(), other=unique(other)) %>%
as.data.frame
> sampled_df
seq count other
1 DOOIUDBD 21 T
2 DSLIHFEIHDFS 53 U
3 SDIUFONBSD 102 T
4 SDOIHHFOEKN 124 G
Upvotes: 4