sunta3iouxos
sunta3iouxos

Reputation: 61

Calculate differences between sequences in vector, for distance matrix in R

Hi all I am trying to create a distance matrix from a random created sequence. #set the code

    DNA <- c("A","G","T","C")
    randomDNA <- c()

#create the vector of 64 elements

    for (i in 1:64){
      randomDNA[i] <- paste0(sample(DNA, 6, replace = T), sep = "", collapse = "")
      warnings()
    }
    sizeofDNA <- length(randomDNA)

#this part that I want to iterate between vector's components

    split_vector <- c()
    DNAdiff <- c()
    for (i in 1:length(randomDNA)){
      split_vector <- strsplit(randomDNA[i], "")[[1]]
      #print(split_vector)
      for (j in 1:length(randomDNA)){
      split_vector2 <- strsplit(randomDNA[j], "")[[1]]
      #print(split_vector2)
      DNAdiff[i,j] <- setdiff(split_vector,split_vector2)
      #or
      #DNAdiff[i] <- lenght(setdiff(strsplit(randomDNA[22], "")[[1]],strsplit(randomDNA[33], "")[[1]]))
      }
    }

What it does not work is A: the setdiff does not work as I expect B: no array is created

Question how do I export the results of the setdiff (if it will work) to an array so that I will have the distance matrix like array? Any recommendation is highly welcomed. Thank you all

EDIT: So there are 2 solutions:

A. Using, as mentioned in the comments by @ThomasIsCoding , the "adist" function; this will calculate the Levenshtein distances:

    DNA <- c("A","G","T","C")
    randomDNA <- c()
    
    for (i in 1:64){
      randomDNA[i] <- paste0(sample(DNA, 6, replace = T), sep = "", collapse = "")
    }
    
    dm <-as.matrix(adist(randomDNA))
    
    rownames(dm) <- randomDNA
    colnames(dm) <- randomDNA
    
    pdf("heatmap.pdf")
    heatmap(dm, Rowv = NA, Colv = NA)
    dev.off()
    write.csv(dm,"distance_matrix.csv", row.names   = T, col.names  = T )

B. Another method to calculate the Hamming distance will be:

DNA <- c("A","G","T","C")
randomDNA <- c()

for (i in 1:96){
  randomDNA[i] <- paste0(sample(DNA, 6, replace = T), sep = "", collapse = "")
}

Humm <- matrix(nrow=length(randomDNA), ncol=length(randomDNA))
for (i in 1:length(randomDNA)){
  split_vector <- strsplit(randomDNA[i], "")[[1]]
  for (j in 1:length(randomDNA)){
    split_vector2 <- strsplit(randomDNA[j], "")[[1]]
    #Hamming distance is calculated as:
    Humm[i,j] <- sum(split_vector != split_vector2)
  }
}

rownames(Humm) <- randomDNA
colnames(Humm) <- randomDNA
pdf("heatmap.pdf")
heatmap(Humm, Rowv = NA, Colv = NA)
dev.off()
write.csv(Humm,"distance_matrix.csv", row.names = T, col.names  = T )

Upvotes: 1

Views: 210

Answers (1)

ThomasIsCoding
ThomasIsCoding

Reputation: 101099

I think you you might need adist to get the distance matrix, e.g.,

adist(randomDNA)

Upvotes: 1

Related Questions