How to find the % match/similarity between cells in a table in R?

Question

I have a bunch of sequences in a table (ie. TCGATCGATCGA) and I want to find those that are 90% matches. I am looking at the RecordLinkage package and function levenshteinSim. I know I can manually import each of the sequences and compare, but I have over a 1000 sequences, so how would I get it to automatically compare each row to each other?

Gainz · Accepted Answer

The same function is in Mako212's link, altough I want to give some explanations since I use this package sometimes, it can be quite useful. We will use the levenshteinSim() function from the RecordLinkage package.

Package:

install.packages("RecordLinkage")
library(RecordLinkage)

Find those 90% matches:

data <- c("tcgartyu", "tcgart", "tckael", "tcgatcgatc", "tcgatcgatcg")
[1] "tcgartyu"   "tcgart"     "tckael"     "tcgatcgatc"   "tcgatcgatcg"

matches <- levenshteinSim('tcgatcgatcga', data)
[1] 0.42 0.42 0.25 0.83 0.92

matches_90 <- matches > 0.9
[1] FALSE FALSE FALSE FALSE  TRUE

So with this function you will be able to get the rows that matches 90% (or greater like in my example). You can then use those % matches the way you wanted to.

Please note that the str1 and str2 arguments from the levenshteinSim() function need to be character vectors.

For more informations go on https://cran.r-project.org/package=RecordLinkage .

How to find the % match/similarity between cells in a table in R?

Answers (2)

Related Questions