John Smith
John Smith

Reputation: 2876

R: Speed up Code for Levenschtein match

I have two data tables which contain city names. The first mydf contains a list of cities we want to check. It consists of 18990 records. The second is a reference table i have been given which contains 353766 rows.

The structure of both tables is below (head 10)

structure(list(country = "LT", city = "VILNIUS", cleaned_city = "VILNIUS"), .Names = c("country", 
"city", "cleaned_city"), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -1L))

Table myref can contain cities that don't exist in mydf as its our reference table. Table mydf can contain cities that don't exits in myref as we are trying to determine what is missing from our reference table. Table mydf may contain cities that are spelt subtly different to what we have in myref.

The cities exist from many countries so using for loops I implemented the following logic:

  1. loop through every country in mydf
  2. loop through every city in mydf for the country in loop 1 and implement a levenschtein match between the same country in the reference table (different countries can have similiar towns so this is the reason for the first loop)
  3. record the %match and the most similar town in the mydf table for manual examination after the code has run.

I have tried creating a second column in the mydf table with every possible match to a city within that country and then run the levenschtein match but i ran out of memory (the table is too large and im on 32 bit windows laptop) which is why i went back to the for loops which is taking days to run. Can anyone help. My code is below (i know for loops are probably the least optimal way to approach this so this should be a great learning experience) and if any more information is needed please let me know

# Initialize variables, my.country contains all the unique countries in     mydf
my.country <- unique(mydf$country)
mydf.sample <- mydf[0, ]
myref.sample <- myref[0, ]
mydf.final <- mydf[0, ] %>%
  mutate(levdist = 0,
      town.match = '')

# For each country, Take each item in mydf, compare it to every record     in the reference table myref
# get the best levenschtein match and score
# add the levenschtein score and the city matched to mydf

for(intcountry in 1:length(my.country))
  {

  # Filter the mydf Table & myRef Table to specific countries based on the intcountry iteration
  mydf.sample <- mydf %>% 
    filter(country == my.country[intcountry])

  myref.sample <- mydf %>% 
    filter(country == my.country[intcountry])

  # Inititalize the temp vector to the size of the mydf size
    vector <- character(length(mydf.sample))

# Set Up Levenschtein Distance
# For every Record in the Dataframe to be checked
for(item in 1:nrow(mydf.sample))
{
  # For every Record in the Reference Table
  for(k in 1:nrow(myref))
  {
    vector[k] = levenshteinSim(mydf.sample$City[item],myref$city[k])
  }
  # Get index of the highest levenschtein match
  max.match.index = match(max(vector),vector)
  mydf.sample$levdist[item] = max(vector)
  mydf.sample$town.match[item] = myref$city[max.match.index]
  vector <- character(length(mydf.sample))
}
mydf.final<- rbind(mydf.sample, mydf.final)
 }

Upvotes: 2

Views: 160

Answers (1)

Jaap
Jaap

Reputation: 83215

When speed and memory issues are important, then the data.table package is often a good alternative. As you didn't provide example data which illustrate the problem, I created some (see the end of this answer for the dputs of the used data).

1: First you have to transform your dataframes into datatables:

library(data.table)
setDT(mydf, key=c("country","city"))
setDT(myref, key=c("country","city"))

With the key=c("country","city") part you also create a reference for each datatable.

2: Now you can easily remove the entries in mydf which are also in the reference datset myref with:

mydf <- mydf[!myref]

which gives:

> mydf
   country   city
1:      GB LONDON
2:      LT KAUNAS
3:      LT VILNUS

As you can see, the record (row) for "VILNIUS" is removed from mydf but the record/row for "VILNUS" is not because it is not an exact match.

EDIT: I removed the 3rd and 4th option as they seem not to be working properly.


Used data:

mydf <- structure(list(country = c("LT","GB","LT","LT"), city = c("VILNIUS","LONDON","KAUNAS","VILNUS")), .Names = c("country", "city"), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -1L))
myref <- structure(list(country = c("LT","NL"), city = c("VILNIUS","AMSTERDAM")), .Names = c("country", "city"), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -1L))

Upvotes: 1

Related Questions