Esben Eickhardt
Esben Eickhardt

Reputation: 3872

R: Dividing vector into intervals, and test which integer falls into what interval

I have 23 chromosomes and their lengths

chromosome    length
1             249250621
2             243199373
3             198022430
4             191154276
5             180915260
6             171115067 
..            .........
Y             59373566

For each chromosome, I want to create 5000 bins/intervals of equal size.

Chr1:
bin_number    start        end
1             1            49850
2             49851        99700
....          .....        .....
5000          249200771    249250621

I have tried using both "cut" and "cut2" for this purpose. "cut2" cannot handle the lenght of the chromosomes and crashes, while cut provides one with intervals for each individual place (249250621 intervals!).

cut2(1:249250621, g=5000, onlycuts = TRUE)

cut(1:249250621, breaks=5000)

When I have the the intervals, I want to assign which bin/interval 50.000 variants each fall within.

My data (Chromosome 1):

variant            chromosome    position
1:20000_G/A        1             20000
1:30000_C/CCCCT    1             30000
1:60000_G/T        1             60000
..............     ..            .......

What I want:

variant            chromosome    position    bin_number
1:20000_G/A        1             20000       1
1:30000_C/CCCCT    1             30000       1
1:60000_G/T        1             60000       2
..............     ..            .......     ...

I would appreciate any suggestions for methods that are relevant for splitting my chromosomes into intervals. When I have the intervals, I need methods that quickly can test which interval the variant belongs to.

Upvotes: 0

Views: 2580

Answers (3)

Ibon Tamayo
Ibon Tamayo

Reputation: 91

If I understood well your algorithm, you are splitting each chromosome into 10000 bins. So you will get different sizes for each range. I used to apply this algorithm to create same-size ranges independent on the chromosome.

chrSizes <- data.frame(chromosome = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20", "21", "22", "X", "Y"), 
                       length = c("249250621","243199373", "198022430", "191154276", "180915260", "171115067", "159138663", "146364022", "141213431", "135534747", "135006516", "133851895", "115169878", "107349540", "102531392", "90354753", "81195210", "78077248", "59128983", "63025520", "48129895", "51304566", "155270560", "59373566"), 
                       stringsAsFactors = FALSE)

sizerange <- 5000000
lastranges <- NA
h <- 0

for (i in 1:24) 
{
  thelast <- 1
  bynum <- format(sizerange, scientific = FALSE)
  chrlist <- c(paste0(chrSizes$chromosome[i],":1-",bynum))
  biggest <- chrSizes$length[i]
  while(thelast < biggest)
  {
    bynum1 <- format(as.numeric(bynum)+1, scientific = FALSE)
    bynum2 <- format(as.numeric(bynum1)+sizerange-1, scientific = FALSE)
    berria <- paste0(paste0(chrSizes$chromosome[i],":",bynum1,"-",as.character(bynum2)))
    chrlist <- c(chrlist,berria)
    thelast <- as.numeric(bynum2)+sizerange
    bynum <- format(as.numeric(bynum)+sizerange, scientific = FALSE)
  }
  azkenreg <- paste0(paste0(chrSizes$chromosome[i],":",bynum,"-",as.character(biggest)))
  chrlist <- c(chrlist,azkenreg)
  lastranges <- c(lastranges,chrlist)
}

lastranges <- lastranges[-1]

df <- data.frame(lastranges)
write.table(df,file = "fastacontigs_splited_bysize2.txt",quote = FALSE, row.names = FALSE, col.names = FALSE)

In this case the results was:

1:1-5000000
1:5000001-10000000
1:10000001-15000000
1:15000000-249250621
2:1-5000000
2:5000001-10000000
2:10000001-15000000
2:15000000-243199373
3:1-5000000
3:5000001-10000000
3:10000001-15000000
3:15000000-198022430
4:1-5000000
4:5000001-10000000
4:10000001-15000000
4:15000000-191154276
5:1-5000000
5:5000001-10000000
5:10000001-15000000
5:15000000-180915260
6:1-5000000
6:5000001-10000000
6:10000001-15000000
6:15000000-171115067
7:1-5000000
7:5000001-10000000
7:10000001-15000000
7:15000000-159138663
8:1-5000000
8:5000001-10000000
8:10000001-15000000
8:15000000-146364022
9:1-5000000
9:5000001-10000000
9:10000001-15000000
9:15000000-141213431
10:1-5000000
10:5000001-10000000
10:10000001-15000000
10:15000000-135534747
11:1-5000000
11:5000001-10000000
11:10000001-15000000
11:15000000-135006516
12:1-5000000
12:5000001-10000000
12:10000001-15000000
12:15000000-133851895
13:1-5000000
13:5000001-10000000
13:10000001-15000000
13:15000000-115169878
14:1-5000000
14:5000001-10000000
14:10000001-15000000
14:15000000-107349540
15:1-5000000
15:5000001-10000000
15:10000001-15000000
15:15000000-102531392
16:1-5000000
16:5000001-10000000
16:10000001-15000000
16:15000001-20000000
16:20000001-25000000
16:25000001-30000000
16:30000001-35000000
16:35000001-40000000
16:40000001-45000000
16:45000001-50000000
16:50000001-55000000
16:55000001-60000000
16:60000001-65000000
16:65000001-70000000
16:70000001-75000000
16:75000001-80000000
16:80000001-85000000
16:85000000-90354753
17:1-5000000
17:5000001-10000000
17:10000001-15000000
17:15000001-20000000
17:20000001-25000000
17:25000001-30000000
17:30000001-35000000
17:35000001-40000000
17:40000001-45000000
17:45000001-50000000
17:50000001-55000000
17:55000001-60000000
17:60000001-65000000
17:65000001-70000000
17:70000001-75000000
17:75000000-81195210
18:1-5000000
18:5000001-10000000
18:10000001-15000000
18:15000001-20000000
18:20000001-25000000
18:25000001-30000000
18:30000001-35000000
18:35000001-40000000
18:40000001-45000000
18:45000001-50000000
18:50000001-55000000
18:55000001-60000000
18:60000001-65000000
18:65000000-78077248
19:1-5000000
19:5000001-10000000
19:10000001-15000000
19:15000001-20000000
19:20000001-25000000
19:25000001-30000000
19:30000001-35000000
19:35000001-40000000
19:40000001-45000000
19:45000000-59128983
20:1-5000000
20:5000001-10000000
20:10000001-15000000
20:15000001-20000000
20:20000001-25000000
20:25000001-30000000
20:30000001-35000000
20:35000001-40000000
20:40000001-45000000
20:45000001-50000000
20:50000001-55000000
20:55000000-63025520
21:1-5000000
21:5000001-10000000
21:10000001-15000000
21:15000001-20000000
21:20000001-25000000
21:25000001-30000000
21:30000001-35000000
21:35000000-48129895
22:1-5000000
22:5000001-10000000
22:10000001-15000000
22:15000001-20000000
22:20000001-25000000
22:25000001-30000000
22:30000001-35000000
22:35000001-40000000
22:40000001-45000000
22:45000000-51304566
X:1-5000000
X:5000001-10000000
X:10000001-15000000
X:15000000-155270560
Y:1-5000000
Y:5000001-10000000
Y:10000001-15000000
Y:15000001-20000000
Y:20000001-25000000
Y:25000001-30000000
Y:30000001-35000000
Y:35000001-40000000
Y:40000001-45000000
Y:45000000-59373566

Upvotes: 1

Esben Eickhardt
Esben Eickhardt

Reputation: 3872

Thanks for your input. I have chosen to create the intervals using a simple loop, to ensure that the intervals were of the desired size.

I created a data.frame with sizes of chromosomes

chrSizes <- data.frame(chromosome = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20", "21", "22", "X", "Y"), length = c("249250621","243199373", "198022430", "191154276", "180915260", "171115067", "159138663", "146364022", "141213431", "135534747", "135006516", "133851895", "115169878", "107349540", "102531392", "90354753", "81195210", "78077248", "59128983", "63025520", "48129895", "51304566", "155270560", "59373566"), stringsAsFactors = FALSE)

Then I looped through each chromosome creating the intervals, by finding the precise chunk size and then rounding down. The remainder then was used to add one to the first many intervals.

numOfBins <- 10000
chrBinList <- list()
for (i in 1:24) {
  chrBins <- c()
  chrLength <- as.numeric(chrSizes[i,2])
  chunkSize <- floor(chrLength/numOfBins)
  remainder <- chrLength %% chunkSize
  counter <- 1

  # Adding remainder to the first intervals
  for (j in 1:(remainder-1)) {
    chrBins <- c(chrBins, counter)
    counter <- counter + chunkSize + 1
    chrBins <- c(chrBins, counter)
  }

  # Adding normal sized chunks to remaining intervals
  for (k in remainder:numOfBins) {
    chrBins <- c(chrBins, counter)
    counter <- counter + chunkSize
    chrBins <- c(chrBins, counter)
  }

  # Creating a data.frame with intervals
  interval.df <- as.data.frame(matrix(chrBins,ncol = 2, byrow = TRUE))
  colnames(interval.df) <- c("start", "end")

  # Adding to list
  chrBinList[[chrSizes[i,1]]] <- interval.df
}

As for testing whether values fall within the different bins, I have come up with a slow solution using apply. I am, however, currently look into the parallel apply functions.

Upvotes: 0

Kota Mori
Kota Mori

Reputation: 6760

If the bin range is constant, this works:

mydata <- data.frame(position = c(20000, 30000, 60000, 
                              49850, 49851, 99700, 99701))
mydata$bin <- ceiling(mydata$position / 49850)

More generally, if the bin range is not constant but you have defined the cut points already, you can use cut by giving that as breaks.

cutpoints <- c(0, 49850, 99700, 149550)
mydata$bin2 <- cut(mydata$position, breaks = cutpoints)

You can edit your labels with a little tweak.

mydata$bin3 <- cut(mydata$position, breaks = cutpoints,
               labels = seq(length(cutpoints)-1))

Upvotes: 0

Related Questions