Reputation: 15
I have a bunch of sequences in a table (ie. TCGATCGATCGA) and I want to find those that are 90% matches. I am looking at the RecordLinkage package and function levenshteinSim. I know I can manually import each of the sequences and compare, but I have over a 1000 sequences, so how would I get it to automatically compare each row to each other?
Upvotes: 0
Views: 363
Reputation: 1771
The same function is in Mako212's link, altough I want to give some explanations since I use this package sometimes, it can be quite useful. We will use the levenshteinSim()
function from the RecordLinkage
package.
Package:
install.packages("RecordLinkage")
library(RecordLinkage)
Find those 90% matches:
data <- c("tcgartyu", "tcgart", "tckael", "tcgatcgatc", "tcgatcgatcg")
[1] "tcgartyu" "tcgart" "tckael" "tcgatcgatc" "tcgatcgatcg"
matches <- levenshteinSim('tcgatcgatcga', data)
[1] 0.42 0.42 0.25 0.83 0.92
matches_90 <- matches > 0.9
[1] FALSE FALSE FALSE FALSE TRUE
So with this function you will be able to get the rows that matches 90% (or greater like in my example). You can then use those % matches the way you wanted to.
Please note that the str1
and str2
arguments from the levenshteinSim()
function need to be character vectors.
For more informations go on https://cran.r-project.org/package=RecordLinkage .
Upvotes: 1
Reputation: 171
I would recommend you look at that string distance package. Specifically, this stringdist() function which gives you a numeric output related to how far one string is from another. You should be able to play around with thresholds to suit your purposes.
https://cran.r-project.org/web/packages/stringdist/stringdist.pdf
Best, Mostafa
Upvotes: 0