Reputation: 2876
I have two data tables which contain city names. The first mydf
contains a list of cities we want to check. It consists of 18990 records. The second is a reference table i have been given which contains 353766 rows.
The structure of both tables is below (head 10)
structure(list(country = "LT", city = "VILNIUS", cleaned_city = "VILNIUS"), .Names = c("country",
"city", "cleaned_city"), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -1L))
Table myref can contain cities that don't exist in mydf
as its our reference table. Table mydf
can contain cities that don't exits in myref
as we are trying to determine what is missing from our reference table.
Table mydf
may contain cities that are spelt subtly different to what we have in myref
.
The cities exist from many countries so using for loops I implemented the following logic:
mydf
mydf
for the country in loop 1 and implement a levenschtein match between the same country in the reference table (different countries can have similiar towns so this is the reason for the first loop)mydf
table for manual examination after the code has run.I have tried creating a second column in the mydf table with every possible match to a city within that country and then run the levenschtein match but i ran out of memory (the table is too large and im on 32 bit windows laptop) which is why i went back to the for loops which is taking days to run. Can anyone help. My code is below (i know for loops are probably the least optimal way to approach this so this should be a great learning experience) and if any more information is needed please let me know
# Initialize variables, my.country contains all the unique countries in mydf
my.country <- unique(mydf$country)
mydf.sample <- mydf[0, ]
myref.sample <- myref[0, ]
mydf.final <- mydf[0, ] %>%
mutate(levdist = 0,
town.match = '')
# For each country, Take each item in mydf, compare it to every record in the reference table myref
# get the best levenschtein match and score
# add the levenschtein score and the city matched to mydf
for(intcountry in 1:length(my.country))
{
# Filter the mydf Table & myRef Table to specific countries based on the intcountry iteration
mydf.sample <- mydf %>%
filter(country == my.country[intcountry])
myref.sample <- mydf %>%
filter(country == my.country[intcountry])
# Inititalize the temp vector to the size of the mydf size
vector <- character(length(mydf.sample))
# Set Up Levenschtein Distance
# For every Record in the Dataframe to be checked
for(item in 1:nrow(mydf.sample))
{
# For every Record in the Reference Table
for(k in 1:nrow(myref))
{
vector[k] = levenshteinSim(mydf.sample$City[item],myref$city[k])
}
# Get index of the highest levenschtein match
max.match.index = match(max(vector),vector)
mydf.sample$levdist[item] = max(vector)
mydf.sample$town.match[item] = myref$city[max.match.index]
vector <- character(length(mydf.sample))
}
mydf.final<- rbind(mydf.sample, mydf.final)
}
Upvotes: 2
Views: 160
Reputation: 83215
When speed and memory issues are important, then the data.table
package is often a good alternative. As you didn't provide example data which illustrate the problem, I created some (see the end of this answer for the dput
s of the used data).
1: First you have to transform your dataframes into datatables:
library(data.table)
setDT(mydf, key=c("country","city"))
setDT(myref, key=c("country","city"))
With the key=c("country","city")
part you also create a reference for each datatable.
2: Now you can easily remove the entries in mydf
which are also in the reference datset myref
with:
mydf <- mydf[!myref]
which gives:
> mydf
country city
1: GB LONDON
2: LT KAUNAS
3: LT VILNUS
As you can see, the record (row) for "VILNIUS" is removed from mydf
but the record/row for "VILNUS" is not because it is not an exact match.
EDIT: I removed the 3rd and 4th option as they seem not to be working properly.
Used data:
mydf <- structure(list(country = c("LT","GB","LT","LT"), city = c("VILNIUS","LONDON","KAUNAS","VILNUS")), .Names = c("country", "city"), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -1L))
myref <- structure(list(country = c("LT","NL"), city = c("VILNIUS","AMSTERDAM")), .Names = c("country", "city"), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -1L))
Upvotes: 1