Hadij
Hadij

Reputation: 4600

using ngram in clustering protein data (ngram.NGram.compare equivalent in R)

There is some sequence data to be compared. The expected output is the distance matrix which shows how similar each sequence is to the others. Previously, I used ngram.NGram.compare in Python and now I want to switch to R. I found ngram and biogram package but I was unable to find the exact function which generate the expected output.

Assume this is the data

a <- c("ham","bam","comb")

The output should be like this (distance between each item):

#      ham    bam   comb
#ham    0     0.5   0.83
#bam   0.5     0     0.6
#comb  0.83   0.6     0

It is the equivalent Python code for the output:

a = ["ham","bam","comb"]
import ngram
[(1 - ngram.NGram.compare(a[i],a[j],N=1))  
                          for i in range(len(a)) 
                          for j in range((i+1),len(a)) ]

Upvotes: 0

Views: 183

Answers (1)

phiver
phiver

Reputation: 23598

you could use stringdistmatrix from the stringdist package. Check the stringdist-metrics documentation which metrics are available.

a <- c("ham","bam","comb")
stringdist::stringdistmatrix(a, a, method = "jaccard")

          [,1] [,2]      [,3]
[1,] 0.0000000  0.5 0.8333333
[2,] 0.5000000  0.0 0.6000000
[3,] 0.8333333  0.6 0.0000000

Upvotes: 1

Related Questions