Erich
Erich

Reputation: 949

Finding the best string match with R

Starting with this L Hernandez

From a vector containing the following:

[1] "HernandezOlaf "    "HernandezLuciano " "HernandezAdrian "

I tried this:

'subset(ABC, str_detect(ABC, "L Hernandez") == TRUE)'

The name Hernandez which includes the capital L anyplace is the desired output.

The desired output is HernandezLuciano

Upvotes: 4

Views: 1772

Answers (3)

lawyeR
lawyeR

Reputation: 7654

You could modify the following if you only want full names after a capital L:

vec1[grepl("Hernandez", vec1) & grepl("L\\.*", vec1)]
[1] "L Hernandez"       "HernandezLuciano

or

vec1[grepl("Hernandez", vec1) & grepl("L[[:alpha:]]", vec1)]
[1] "HernandezLuciano "

The expression looks for a match on "Hernandez" and then looks to see if there is a capital "L" followed by any character or space. The second version requires a letter after the capital "L".

BTW, it appears that you can't chain the grepls.

vec1[grepl("Hernandez", vec1) & grepl("L\\[[:alpha:]]", vec1)]
character(0)

Upvotes: 0

bartektartanus
bartektartanus

Reputation: 16080

You could use agrep function for approximate string matching. If you simply run this function it matches every string...

agrep("L Hernandez", c("HernandezOlaf ",    "HernandezLuciano ", "HernandezAdrian "))
[1] 1 2 3

but if you modify this a little "L Hernandez" -> "Hernandez L"

agrep("Hernandez L", c("HernandezOlaf ",    "HernandezLuciano ", "HernandezAdrian "))
[1] 1 2 3

and change the max distance

agrep("Hernandez L", c("HernandezOlaf ",    "HernandezLuciano ", "HernandezAdrian "),0.01)
[1] 2

you get the right answer. This is only an idea, it might work for you :)

Upvotes: 0

akrun
akrun

Reputation: 887038

May be this helps:

vec1 <- c("L Hernandez", "HernandezOlaf ","HernandezLuciano ", "HernandezAdrian ")
grep("L ?Hernandez|Hernandez ?L",vec1,value=T)
#[1] "L Hernandez" "HernandezLuciano "

Update

variable <- "L Hernandez"

v1 <- gsub(" ", " ?", variable) #replace space with a space and question mark 
v2 <- gsub("([[:alpha:]]+) ([[:alpha:]]+)", "\\2 ?\\1", variable) #reverse the order of words in the string and add question mark

You can also use strsplit to split variable as @rawr commented

grep(paste(v1,v2, sep="|"), vec1,value=T)
#[1] "L Hernandez"       "HernandezLuciano "

Upvotes: 2

Related Questions