Reputation: 4600
There is some sequence data to be compared. The expected output is the distance matrix which shows how similar each sequence is to the others. Previously, I used ngram.NGram.compare
in Python and now I want to switch to R. I found ngram
and biogram
package but I was unable to find the exact function which generate the expected output.
Assume this is the data
a <- c("ham","bam","comb")
The output should be like this (distance between each item):
# ham bam comb
#ham 0 0.5 0.83
#bam 0.5 0 0.6
#comb 0.83 0.6 0
It is the equivalent Python code for the output:
a = ["ham","bam","comb"]
import ngram
[(1 - ngram.NGram.compare(a[i],a[j],N=1))
for i in range(len(a))
for j in range((i+1),len(a)) ]
Upvotes: 0
Views: 183
Reputation: 23598
you could use stringdistmatrix
from the stringdist
package. Check the stringdist-metrics
documentation which metrics are available.
a <- c("ham","bam","comb")
stringdist::stringdistmatrix(a, a, method = "jaccard")
[,1] [,2] [,3]
[1,] 0.0000000 0.5 0.8333333
[2,] 0.5000000 0.0 0.6000000
[3,] 0.8333333 0.6 0.0000000
Upvotes: 1