tamcle
tamcle

Reputation: 15

How to find the % match/similarity between cells in a table in R?

I have a bunch of sequences in a table (ie. TCGATCGATCGA) and I want to find those that are 90% matches. I am looking at the RecordLinkage package and function levenshteinSim. I know I can manually import each of the sequences and compare, but I have over a 1000 sequences, so how would I get it to automatically compare each row to each other?

Upvotes: 0

Views: 363

Answers (2)

Gainz
Gainz

Reputation: 1771

The same function is in Mako212's link, altough I want to give some explanations since I use this package sometimes, it can be quite useful. We will use the levenshteinSim() function from the RecordLinkage package.

Package:

install.packages("RecordLinkage")
library(RecordLinkage)

Find those 90% matches:

data <- c("tcgartyu", "tcgart", "tckael", "tcgatcgatc", "tcgatcgatcg")
[1] "tcgartyu"   "tcgart"     "tckael"     "tcgatcgatc"   "tcgatcgatcg"

matches <- levenshteinSim('tcgatcgatcga', data)
[1] 0.42 0.42 0.25 0.83 0.92

matches_90 <- matches > 0.9
[1] FALSE FALSE FALSE FALSE  TRUE

So with this function you will be able to get the rows that matches 90% (or greater like in my example). You can then use those % matches the way you wanted to.

Please note that the str1 and str2 arguments from the levenshteinSim() function need to be character vectors.

For more informations go on https://cran.r-project.org/package=RecordLinkage .

Upvotes: 1

Mostafa Lotfi
Mostafa Lotfi

Reputation: 171

I would recommend you look at that string distance package. Specifically, this stringdist() function which gives you a numeric output related to how far one string is from another. You should be able to play around with thresholds to suit your purposes.

https://cran.r-project.org/web/packages/stringdist/stringdist.pdf

Best, Mostafa

Upvotes: 0

Related Questions