nouse
nouse

Reputation: 3461

Compare the similarity of character vectors by position

I have the following dataset:

df <- data.frame(barcode=c("B1","B2", "B3", "B4"), 
                 sequence= sapply(1:4, function(x) paste(sample(c("A","C","T","G"), 4, replace=T), collapse="")))

I want to know how similar each 'barcode' is compared to any other 'barcode' in df$barcode. That is, by position.

A complete agreement would be 100%, one position in disagreement would be 75% and so on.

Example: df$barcode contains (AATT, AATT, TATT, TATA)

the pairwise similarity matrix would be then

 B1 B2 B3 B4
B1 x 100 75 50
B2 100 x 75 50
B3 75 75 x 75
B4 50 50 75 x

even though every 'Barcode" contains 2xT and 2xA. So, the question is "how many positions have the same content between two Barcodes?" How to achieve this in R?

Upvotes: 1

Views: 154

Answers (1)

user2974951
user2974951

Reputation: 10375

Using Levenshtein (edit) distance, or rather 1-distance

> 1-adist(df$sequence)/4

     [,1] [,2] [,3] [,4]
[1,] 1.00 0.75 0.25 0.25
[2,] 0.75 1.00 0.00 0.25
[3,] 0.25 0.00 1.00 0.50
[4,] 0.25 0.25 0.50 1.00

(assuming all lengths equal to 4).

Edit: I misunderstood your problem. Levenshtein distance finds maximal matching, so reordering the strings if necessary. You want an exact word for word matching, in that case...

sapply(df$sequence,function(x){
  sapply(df$sequence,function(y){
    sum(strsplit(x,"")[[1]]==strsplit(y,"")[[1]])
  })
})/4

     ACAC AGAC CCTT CGCT
ACAC 1.00 0.75 0.25 0.00
AGAC 0.75 1.00 0.00 0.25
CCTT 0.25 0.00 1.00 0.50
CGCT 0.00 0.25 0.50 1.00

or for the other vector provided in the comments

sapply(df$sequence,function(x){
  sapply(df$sequence,function(y){
    sum(strsplit(x,"")[[1]]==strsplit(y,"")[[1]])
  })
})/4

     GACC AAAC ACAC GCCA
GACC 1.00 0.50 0.25 0.50
AAAC 0.50 1.00 0.75 0.00
ACAC 0.25 0.75 1.00 0.25
GCCA 0.50 0.00 0.25 1.00

Upvotes: 2

Related Questions