amarah
amarah

Reputation: 23

Random sampling based on vector with multiple conditions R

I have a large dataframe SYN_data with 150000 rows and 3 columns named SNP, Gene and count.There is a list r with 2545 count values that also include some duplicates. Now I need to randomly sample 2545 rows without replacement from SYN_data with similar count values as in the list r. I could successfully do it until here by using this code:

test1 <- SYN_data[ sample( which( SYN_data$count %in% r ) , 2545 ) , ]

The second condition is that the unique length of Genes should be 1671 in total 2545 rows, means that some of the Genes have more than 1 SNPs. Is there any way I can incorporate this condition in the same code or any other code meeting all conditions would be very helpful. Thanks!

Sample data:

# list
r 
> 1,7,3,14,9

SYN_data$SNP <- c('1- 10068526', '1- 10129891', '1- 10200104', 
                  '1- 10200491', '1- 10470141', '1- 10671598')

SYN_data$Gene <- c('AT1G28640', 'AT1G29030', 'AT1G29180', 
                   'AT1G29180', 'AT1G29900', 'AT1G30290')

SYN_data$count <- c('14',  '9',  '3',  '3',  '7',  '1')

Upvotes: 1

Views: 1305

Answers (2)

Ronak Shah
Ronak Shah

Reputation: 389335

Try using the following :

library(dplyr)

no_of_rows <- 2545
no_of_unique_gene <- 1671
temp <- SYN_data

while(n_distinct(temp$Gene) != no_of_unique_gene) {
  gene <- sample(unique(SYN_data$Gene),no_of_unique_gene)
  temp <- SYN_data[SYN_data$V23 %in% unique(r) & SYN_data$Gene %in% gene, ]
}
part1  <- temp %>% group_by(Gene) %>% sample_n(floor(no_of_rows/no_of_unique_gene))
part2 <- temp %>% anti_join(part1) %>% sample_n(no_of_rows - nrow(part1)) 
final <- bind_rows(part1, part2)

and now check length(unique(final$Gene)).

Upvotes: 1

chinsoon12
chinsoon12

Reputation: 25208

An possible approach is to sample 1671 unique genes first, subset the dataset to those that share those genes and has count in the set of r. Here is an implementation of this approach in data.table:

#had to create some dummy data as not clear what the data is like
set.seed(0L)
nr <- 15e4
nSNP <- 1e3 
nGene <- 1e4
ncount <- 1:14     
r <- c(1,3,7,9,14)
SYN_data <- data.table(SNP=sample(nSNP, nr, TRUE),
    Gene=sample(nGene, nr, TRUE), count=sample(ncount, nr, TRUE))

ncnt <- 2545
ng <- 1671

#sample 1671 genes
g <- SYN_data[, sample(unique(Gene), ng)]    

#subset and sample the dataset
ix <- SYN_data[Gene %in% g & count %in% r, sample(.I, 1L), Gene]$V1
ans <- rbindlist(list(
    SYN_data[ix],
    SYN_data[-ix][Gene %in% g & count %in% r][, .SD[sample(.I, ncnt - ng)]]))
ans[, uniqueN(Gene)]
#1662 #not enough Gene in this dummy dataset

output:

      SNP Gene count
   1: 816 1261    14
   2:   7 8635     1
   3: 132 7457     1
   4:  22 3625     3
   5: 396 7640     7
  ---               
2534: 423 6387     3
2535: 936 3908     7
2536: 346 9654    14
2537: 182 7492     3
2538: 645  635     1

Upvotes: 0

Related Questions