R: Speed up Code for Levenschtein match

Question

I have two data tables which contain city names. The first mydf contains a list of cities we want to check. It consists of 18990 records. The second is a reference table i have been given which contains 353766 rows.

The structure of both tables is below (head 10)

structure(list(country = "LT", city = "VILNIUS", cleaned_city = "VILNIUS"), .Names = c("country", 
"city", "cleaned_city"), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -1L))

Table myref can contain cities that don't exist in mydf as its our reference table. Table mydf can contain cities that don't exits in myref as we are trying to determine what is missing from our reference table. Table mydf may contain cities that are spelt subtly different to what we have in myref.

The cities exist from many countries so using for loops I implemented the following logic:

loop through every country in mydf
loop through every city in mydf for the country in loop 1 and implement a levenschtein match between the same country in the reference table (different countries can have similiar towns so this is the reason for the first loop)
record the %match and the most similar town in the mydf table for manual examination after the code has run.

I have tried creating a second column in the mydf table with every possible match to a city within that country and then run the levenschtein match but i ran out of memory (the table is too large and im on 32 bit windows laptop) which is why i went back to the for loops which is taking days to run. Can anyone help. My code is below (i know for loops are probably the least optimal way to approach this so this should be a great learning experience) and if any more information is needed please let me know

# Initialize variables, my.country contains all the unique countries in     mydf
my.country <- unique(mydf$country)
mydf.sample <- mydf[0, ]
myref.sample <- myref[0, ]
mydf.final <- mydf[0, ] %>%
  mutate(levdist = 0,
      town.match = '')

# For each country, Take each item in mydf, compare it to every record     in the reference table myref
# get the best levenschtein match and score
# add the levenschtein score and the city matched to mydf

for(intcountry in 1:length(my.country))
  {

  # Filter the mydf Table & myRef Table to specific countries based on the intcountry iteration
  mydf.sample <- mydf %>% 
    filter(country == my.country[intcountry])

  myref.sample <- mydf %>% 
    filter(country == my.country[intcountry])

  # Inititalize the temp vector to the size of the mydf size
    vector <- character(length(mydf.sample))

# Set Up Levenschtein Distance
# For every Record in the Dataframe to be checked
for(item in 1:nrow(mydf.sample))
{
  # For every Record in the Reference Table
  for(k in 1:nrow(myref))
  {
    vector[k] = levenshteinSim(mydf.sample$City[item],myref$city[k])
  }
  # Get index of the highest levenschtein match
  max.match.index = match(max(vector),vector)
  mydf.sample$levdist[item] = max(vector)
  mydf.sample$town.match[item] = myref$city[max.match.index]
  vector <- character(length(mydf.sample))
}
mydf.final<- rbind(mydf.sample, mydf.final)
 }

R: Speed up Code for Levenschtein match

Answers (1)

Related Questions