Reputation: 23
I have a large dataframe SYN_data with 150000 rows and 3 columns named SNP, Gene and count.There is a list r with 2545 count values that also include some duplicates. Now I need to randomly sample 2545 rows without replacement from SYN_data with similar count values as in the list r. I could successfully do it until here by using this code:
test1 <- SYN_data[ sample( which( SYN_data$count %in% r ) , 2545 ) , ]
The second condition is that the unique length of Genes should be 1671 in total 2545 rows, means that some of the Genes have more than 1 SNPs. Is there any way I can incorporate this condition in the same code or any other code meeting all conditions would be very helpful. Thanks!
Sample data:
# list
r
> 1,7,3,14,9
SYN_data$SNP <- c('1- 10068526', '1- 10129891', '1- 10200104',
'1- 10200491', '1- 10470141', '1- 10671598')
SYN_data$Gene <- c('AT1G28640', 'AT1G29030', 'AT1G29180',
'AT1G29180', 'AT1G29900', 'AT1G30290')
SYN_data$count <- c('14', '9', '3', '3', '7', '1')
Upvotes: 1
Views: 1305
Reputation: 389335
Try using the following :
library(dplyr)
no_of_rows <- 2545
no_of_unique_gene <- 1671
temp <- SYN_data
while(n_distinct(temp$Gene) != no_of_unique_gene) {
gene <- sample(unique(SYN_data$Gene),no_of_unique_gene)
temp <- SYN_data[SYN_data$V23 %in% unique(r) & SYN_data$Gene %in% gene, ]
}
part1 <- temp %>% group_by(Gene) %>% sample_n(floor(no_of_rows/no_of_unique_gene))
part2 <- temp %>% anti_join(part1) %>% sample_n(no_of_rows - nrow(part1))
final <- bind_rows(part1, part2)
and now check length(unique(final$Gene))
.
Upvotes: 1
Reputation: 25208
An possible approach is to sample 1671 unique genes first, subset the dataset to those that share those genes and has count in the set of r
. Here is an implementation of this approach in data.table
:
#had to create some dummy data as not clear what the data is like
set.seed(0L)
nr <- 15e4
nSNP <- 1e3
nGene <- 1e4
ncount <- 1:14
r <- c(1,3,7,9,14)
SYN_data <- data.table(SNP=sample(nSNP, nr, TRUE),
Gene=sample(nGene, nr, TRUE), count=sample(ncount, nr, TRUE))
ncnt <- 2545
ng <- 1671
#sample 1671 genes
g <- SYN_data[, sample(unique(Gene), ng)]
#subset and sample the dataset
ix <- SYN_data[Gene %in% g & count %in% r, sample(.I, 1L), Gene]$V1
ans <- rbindlist(list(
SYN_data[ix],
SYN_data[-ix][Gene %in% g & count %in% r][, .SD[sample(.I, ncnt - ng)]]))
ans[, uniqueN(Gene)]
#1662 #not enough Gene in this dummy dataset
output:
SNP Gene count
1: 816 1261 14
2: 7 8635 1
3: 132 7457 1
4: 22 3625 3
5: 396 7640 7
---
2534: 423 6387 3
2535: 936 3908 7
2536: 346 9654 14
2537: 182 7492 3
2538: 645 635 1
Upvotes: 0