Almighty Shintru
Almighty Shintru

Reputation: 31

Shuffling string (non-randomly) for maximal difference

After trying for an embarrassingly long time and extensive searches online, I come to you with a problem.

I am looking for a method to (non-randomly) shuffle a string to get a string which has the maximal ‘distance’ from the original one, while still containing the same set of characters.

My particular case is for short nucleotide sequences (4-8 nt long), as represented by these example sequences:

seq_1<-"ACTG"
seq_2<-"ATGTT"
seq_3<-"ACGTGCT"

For each sequence, I would like to get a scramble sequence which contains the same nucleobase count, but in a different order.

A favourable scramble sequence for seq_3 could be something like;

seq_3.scramble<-"CATGTGC"

,where none of the sequence positions 1-7 has the same nucleobase, but the overall nucleobase count is the same (A =1, C = 2, G= 2, T=2). Naturally it would not always be possible to get a completely different string, but these I would just flag in the output.

I am not particularly interested in randomising the sequence and would prefer a method which makes these scramble sequences in a consistent manner.

Do you have any ideas?

Upvotes: 3

Views: 62

Answers (2)

CPak
CPak

Reputation: 13581

Give this a try. Rather than return a single string that fits your criteria, I return a data frame of all strings sorted by their string-distance score. String-distance score is calculated using stringdist(..., ..., method=hamming), which determines number of substitutions required to convert string A to B.

seq_3<-"ACGTGCT"

myfun <- function(S) {
            require(combinat)
            require(dplyr)
            require(stringdist)
            vec <- unlist(strsplit(S, ""))
            P <- sapply(permn(vec), function(i) paste(i, collapse=""))
            Dist <- c(stringdist(S, P, method="hamming"))
            df <- data.frame(seq = P, HD = Dist, fixed=TRUE) %>%
                    distinct(seq, HD) %>%
                    arrange(desc(HD))
            return(df)
        }

library(combinat)
library(dplyr)
library(stringdist)
head(myfun(seq_3), 10)

       # seq HD
# 1  TACGTGC  7
# 2  TACGCTG  7
# 3  CACGTTG  7
# 4  GACGTTC  7
# 5  CGACTTG  7
# 6  CGTACTG  7
# 7  TGCACTG  7
# 8  GTCACTG  7
# 9  GACCTTG  7
# 10 GATCCTG  7

Upvotes: 1

Mohammad Athar
Mohammad Athar

Reputation: 1980

python, since I don't know r, but the basic solution is as follows

def calcDistance(originalString,newString):
    d = 0
    i=0
    while i < len(originalString):
        if originalString[i] != newString[i]: d=d+1
        i=i+1


s = "ACTG"
d_max = 0
s_final = ""
for combo in itertools.permutations(s):
    if calcDistance(s,combo) > d_max:
            d_max = calcDistance(s,combo)
            s_final = combo

Upvotes: 1

Related Questions