ksanb
ksanb

Reputation: 1

partial string matching in R? Is this possible?

I'm not actually sure if this is possible. I have these two data frames that have scientific names. Some of them are misspelled, some have missing spaces, others are homonyms (not the same species), and others match. So I have something like this:

stringDF <- data.frame(string = c("Abietinella abietina (Hedw.) M.Fleisch.", "Abietinella abietina (Hedw.) M. Fleisch.", "Abietinella abietina (Hedw.) Smith", "Abitinella abietina (Hedw.) M. Fleisch."))
patternDF <- data.frame(string = "Abietinella abietina (Hedw.) M. Fleisch.", match = "A")

patternDF has the "correct name" plus a column (that I'm calling "match" containing important information. I'm trying to make a "match" column in stringDF where "A" is pasted when it matches partially. So ideally, I'd like something like this:

string                                      match
Abietinella abietina (Hedw.) M.Fleisch.     A
Abietinella abietina (Hedw.) M. Fleisch.    A
Abietinella abietina (Hedw.) Smith          NA
Abitinella abietina (Hedw.) M. Fleisch.     A

I've tried using this function:

stringDF$match <- patternDF$match[pmatch(stringDF$string, patternDF$string)]

but I'm not having any luck. Is this possible to do in R? I've also tried using the %like% function from the data.frame package.

I'm not the best at coding, so sorry in advance for my ignorance! Thanks y'all!

Upvotes: 0

Views: 102

Answers (1)

ctwheels
ctwheels

Reputation: 22817

You can use the stringdist library (cran here) to accomplish this without some hack-around solution with regex. Regex fuzzy matches are available in some packages and other languages (like PyPi regex for Python - see Approximate "fuzzy" matching).

In any case, it's likely better to use a Levenshtein distance functions for your case (google it for more information - this link has decent information on it).

library(stringdist)

stringdist("Abietinella abietina (Hedw.) M. Fleisch.",c("Abietinella abietina (Hedw.) M.Fleisch.", "Abietinella abietina (Hedw.) M. Fleisch.", "Abietinella abietina (Hedw.) Smith", "Abitinella abietina (Hedw.) M. Fleisch."))

Running the code above yields the following:

1 0 9 1

Those are the Levenshtein distances for each of the 4 strings respectively. You can use the result with some coding logic to only accept those with low enough Levenshtein values. Based on your current strings, I might suggest only keeping strings with values <=4, but you can tweak it as needed.

Upvotes: 1

Related Questions