Dieu94
Dieu94

Reputation: 455

More efficient way to fuzzy match in R?

I am currently working on a data frame with 2 million lines (records). I am wanting to identify potential duplicate records for followup. Someone else has written for me a long code that works, but currently it is taking me overnight to run.

It uses the stringdist package. From what I understand, stringdist works by comparing one row, against all other rows in the data frame. So, a data frame with 5 rows would require 20 computations:

i.e.
row 1 compared to row 2
row 1 compared to row 3
row 1 compared to row 4
row 1 compared to row 5
row 2 compared to row 1
row 2 compared to row 3
row 2 compared to row 4
row 2 compared to row 5
row 3 compared to row 1
row 3 compared to row 2
row 3 compared to row 4
row 3 compared to row 5
row 4 compared to row 1
row 4 compared to row 2
row 4 compared to row 3
row 4 compared to row 5
row 5 compared to row 1
row 5 compared to row 2
row 5 compared to row 3
row 5 compared to row 4

An increase in the size of data frame would exponentially increase the time needed to complete the function. With my rather large data frame, obviously it takes a while.

My proposed solution is this: after comparing each row with all of the other rows in the data frame, is there a way to omit those rows from future computations? For example, in the example above, row 1 compared to row 2 would be the same as row 2 compared to row 1. Could we remove one of these calculations?

So, using the example data frame above, the only computations should be:

row 1 compared to row 2
row 1 compared to row 3
row 1 compared to row 4
row 1 compared to row 5
row 2 compared to row 3
row 2 compared to row 4
row 2 compared to row 5
row 3 compared to row 4
row 3 compared to row 5
row 4 compared to row 5

This is the section in a function in the code that looks for these duplicates in various columns - any ideas on how I can amend this?

lastName <- stringdist(DataND$SURNAME[rownumber],DataND$SURNAME, method='lv')
firstName <- stringdist(DataND$GIVEN.NAME[rownumber],DataND$GIVEN.NAME, method='lv')
birthDate <- stringdist(DataND$DOB[rownumber],DataND$DOB, method='lv')
streetAddress<-stringdist(DataND$ADDR.1[rownumber],DataND$ADDR.1, method='lv')
suburb <- stringdist(DataND$LOCALITY[rownumber],DataND$LOCALITY, method='lv')

Upvotes: 3

Views: 926

Answers (2)

Humpelstielzchen
Humpelstielzchen

Reputation: 6441

H 1's idea is great. Another option would be the fuzzyjoin-package.

library(fuzzyjoin)
library(dplyr)

df <- tibble(id = seq(1,10),
             words = replicate(10, paste(sample(LETTERS, 5), collapse = "")))



stringdist_left_join(df, df, by = c(words = "words"), max_dist = 5, method = "lv", distance_col = "distance") %>%
  filter(distance != 0)

# A tibble: 90 x 5
    id.x words.x  id.y words.y distance
   <int> <chr>   <int> <chr>      <dbl>
 1     1 JUQYR       2 HQMFD          5
 2     1 JUQYR       3 WHQOM          4
 3     1 JUQYR       4 OUWJV          4
 4     1 JUQYR       5 JURGD          3
 5     1 JUQYR       6 ZMLAQ          5
 6     1 JUQYR       7 RWLVU          5
 7     1 JUQYR       8 AYNLE          5
 8     1 JUQYR       9 AUPVJ          4
 9     1 JUQYR      10 JDFEY          4
10     2 HQMFD       1 JUQYR          5
# ... with 80 more rows

Here you have it all set up in the end, you can pick and dismiss rows by distance. It took 11 seconds for 100.000 records. Trying with stringdistmatrix() however I got the error:

Error: cannot allocate vector of size 37.3 Gb

Upvotes: 2

demarsylvain
demarsylvain

Reputation: 2185

lastName<-stringdist(DataND$SURNAME[rownumber], DataND$SURNAME, method='lv')

If i understand this line, it compar one surname (according the value of rownumber) with aller surnames. So when you change rownumber, all comparisons are made, even the ones already done precedently.

To prevent this, try:

lastName<-stringdist(DataND$SURNAME[rownumber], DataND$SURNAME[rownumber:nrows], method='lv') where nrows is the number of rows

Upvotes: 0

Related Questions