P. Denelle
P. Denelle

Reputation: 830

Finding the closest character string in a second data frame in R

I have a quite big data.frame with non updated names and I want to get the correct names that are stored in another data.frame. I am using stringdist function to find the closest match between the two columns and then I want to put the new names in the original data.frame.

I am using a code based on sapply function, as in the following example :

dat1 <- data.frame("name" = paste0("abc", seq(1:5)),
          "value" = round(rnorm(5), 1))


dat2 <- data.frame("name" = paste0("abd", seq(1:5)),
                   "other_info" = seq(11:15))

dat1$name2 <- sapply(dat1$name,
                        function(x){
                          char_min <- stringdist::stringdist(x, dat2$name)
                          dat2[which.min(char_min), "name"]
                        })
dat1

However, this code is too slow considering the size of my data.frame.

Is there a more optimized alternative solution, using for example data.table R package?

Upvotes: 2

Views: 2328

Answers (1)

Jiafei Li
Jiafei Li

Reputation: 53

First convert the data frames into data tables:

dat1 <- data.table(dat1)
dat2 <- data.table(dat2)

Then use the ":=" and "amatch" command to create a new column that approximately matches the two names:

dat1[,name2 := dat2[stringdist::amatch(name, dat2$name)]$name]

This should be much faster than the sapply function. Hope this helps!

Upvotes: 1

Related Questions