DaniCee
DaniCee

Reputation: 3217

Generate a regular numeric sequence by increments of 3 BUT considering gaps in the input

Say I have a sequence like myseq below. It is a DNA sequence, so each group of 3 letters make for one letter (aminoacid) in the accompanying myaa sequence. It has a length of 13aa.

I want to create a mydf dataframe with the start and end positions of each group of 3 letters in myseq. In the basic example below without gaps, I do this easily with seq().

myseq <- "CTACGTAGCTAGCTGGGGTACCGTTATTCAGCTAGCATG"
myaa <- "XYZWWXZZVVYZX"
st_pos <- seq(1, nchar(myseq)-2, 3)
en_pos <- seq(3, nchar(myseq), 3)
mydf <- data.frame(starts=st_pos, ends=en_pos, label=unlist(strsplit(myaa, "")))

This produces this dataframe, which is exactly what I want:

> mydf
   starts ends label
1       1    3     X
2       4    6     Y
3       7    9     Z
4      10   12     W
5      13   15     W
6      16   18     X
7      19   21     Z
8      22   24     Z
9      25   27     V
10     28   30     V
11     31   33     Y
12     34   36     Z
13     37   39     X

However, I am encountering examples with my real data where myseq contains gaps. In these cases I cannot rely on seq() because I need to account for the gaps for the start and end positions.

How should I go about it? I show you 2 cases below, and my expected mydf dataframe, for which I just hard coded the start and end positions.

#case 1 - gaps breaking groups of 3 letters in half
myseq <- "CTACGTAGCTAGCTGGGGTACCGTT---ATTC--AGCTAGCATG"
st_pos <- c(1,4,7,10,13,16,19,22,25,31,36,39,42)
en_pos <- c(3,6,9,12,15,18,21,24,30,35,38,41,44)
mydf <- data.frame(starts=st_pos, ends=en_pos, label=unlist(strsplit(myaa, "")))

#case 2 - gaps in between groups of 3 letters, and at the beginning of groups of 3 letters
myseq <- "CTACGTAGCTAGCTGGGGTACCGT---TAT--TCAGCTAGCATG"
st_pos <- c(1,4,7,10,13,16,19,22,28,33,36,39,42)
en_pos <- c(3,6,9,12,15,18,21,24,30,35,38,41,44)
mydf <- data.frame(starts=st_pos, ends=en_pos, label=unlist(strsplit(myaa, "")))

How should I produce the st_pos and en_pos vectors in these cases in an easy way?

EDIT

To get these numbers, I just split the sequences in groups of 3 letters and count the positions manually, but I don't know how to accomplish that in an automatic way.

For case 1 sequence start positions:

1    4    7    10   13   16   19   22   25      31     36   39   42
CTA  CGT  AGC  TAG  CTG  GGG  TAC  CGT  T---AT  TC--A  GCT  AGC  ATG

Likewise for case 1 end positions:

  3    6    9    12   15   18   21   24      30     35   38   41   44
CTA  CGT  AGC  TAG  CTG  GGG  TAC  CGT  T---AT  TC--A  GCT  AGC  ATG

Now for case 2 sequence start positions:

1    4    7    10   13   16   19   22        28       33   36   39   42
CTA  CGT  AGC  TAG  CTG  GGG  TAC  CGT  ---  TAT  --  TCA  GCT  AGC  ATG

And case 2 end positions:

  3    6    9    12   15   18   21   24        30       35   38   41   44
CTA  CGT  AGC  TAG  CTG  GGG  TAC  CGT  ---  TAT  --  TCA  GCT  AGC  ATG

Upvotes: 1

Views: 106

Answers (2)

jay.sf
jay.sf

Reputation: 73572

You could try something like this:

> f <- \(string, n=3) {
+   chr <- el(strsplit(string, ''))
+   mat <- matrix(split(chr, cumsum(chr != '-')), n)
+   strings3 <- apply(mat, 2, \(x) paste(unlist(x), collapse=''))
+   lens <- cumsum(nchar(strings3))
+   cbind(c(1, lens + 1L)[-(length(lens) + 1L)], lens + 1L)
+ }
> 
> f("CTACGTAGCTAGCTGGGGTACCGTTATTCAGCTAGCATG")
      [,1] [,2]
 [1,]    1    4
 [2,]    4    7
 [3,]    7   10
 [4,]   10   13
 [5,]   13   16
 [6,]   16   19
 [7,]   19   22
 [8,]   22   25
 [9,]   25   28
[10,]   28   31
[11,]   31   34
[12,]   34   37
[13,]   37   40
> f("CTACGTAGCTAGCTGGGGTACCGTT---ATTC--AGCTAGCATG")
      [,1] [,2]
 [1,]    1    4
 [2,]    4    7
 [3,]    7   10
 [4,]   10   13
 [5,]   13   16
 [6,]   16   19
 [7,]   19   22
 [8,]   22   25
 [9,]   25   31
[10,]   31   36
[11,]   36   39
[12,]   39   42
[13,]   42   45
> f("CTACGTAGCTAGCTGGGGTACCGT---TAT--TCAGCTAGCATG")
      [,1] [,2]
 [1,]    1    4
 [2,]    4    7
 [3,]    7   10
 [4,]   10   13
 [5,]   13   16
 [6,]   16   19
 [7,]   19   22
 [8,]   22   28
 [9,]   28   33
[10,]   33   36
[11,]   36   39
[12,]   39   42
[13,]   42   45

Upvotes: 1

Edward
Edward

Reputation: 19339

Try the following, which identifies the index of any upper-case letter in the sequence using grep and then creates the start and end positions from this.

Case #1

myseq <- "CTACGTAGCTAGCTGGGGTACCGTT---ATTC--AGCTAGCATG"

idx <- grep("[A-Z]", unlist(strsplit(myseq, "")))
data.frame(st_pos=idx[seq(1, length(idx), 3)], 
           en_pos=idx[seq(3, length(idx), 3)])

Gives:

   st_pos en_pos
1       1      3
2       4      6
3       7      9
4      10     12
5      13     15
6      16     18
7      19     21
8      22     24
9      25     30
10     31     35
11     36     38
12     39     41
13     42     44

which matches your expected output.


Case #2

myseq <- "CTACGTAGCTAGCTGGGGTACCGT---TAT--TCAGCTAGCATG"

idx <- grep("[A-Z]", unlist(strsplit(myseq, "")))
data.frame(st_pos=idx[seq(1, length(idx), 3)], 
           en_pos=idx[seq(3, length(idx), 3)])

Gives

   st_pos en_pos
1       1      3
2       4      6
3       7      9
4      10     12
5      13     15
6      16     18
7      19     21
8      22     24
9      28     30
10     33     35
11     36     38
12     39     41
13     42     44

Which also matches your expected output.

Upvotes: 1

Related Questions