Reputation: 455
I am currently working on a data frame with 2 million lines (records). I am wanting to identify potential duplicate records for followup. Someone else has written for me a long code that works, but currently it is taking me overnight to run.
It uses the stringdist
package. From what I understand, stringdist
works by comparing one row, against all other rows in the data frame. So, a data frame with 5 rows would require 20 computations:
i.e.
row 1 compared to row 2
row 1 compared to row 3
row 1 compared to row 4
row 1 compared to row 5
row 2 compared to row 1
row 2 compared to row 3
row 2 compared to row 4
row 2 compared to row 5
row 3 compared to row 1
row 3 compared to row 2
row 3 compared to row 4
row 3 compared to row 5
row 4 compared to row 1
row 4 compared to row 2
row 4 compared to row 3
row 4 compared to row 5
row 5 compared to row 1
row 5 compared to row 2
row 5 compared to row 3
row 5 compared to row 4
An increase in the size of data frame would exponentially increase the time needed to complete the function. With my rather large data frame, obviously it takes a while.
My proposed solution is this: after comparing each row with all of the other rows in the data frame, is there a way to omit those rows from future computations? For example, in the example above, row 1 compared to row 2 would be the same as row 2 compared to row 1. Could we remove one of these calculations?
So, using the example data frame above, the only computations should be:
row 1 compared to row 2
row 1 compared to row 3
row 1 compared to row 4
row 1 compared to row 5
row 2 compared to row 3
row 2 compared to row 4
row 2 compared to row 5
row 3 compared to row 4
row 3 compared to row 5
row 4 compared to row 5
This is the section in a function in the code that looks for these duplicates in various columns - any ideas on how I can amend this?
lastName <- stringdist(DataND$SURNAME[rownumber],DataND$SURNAME, method='lv')
firstName <- stringdist(DataND$GIVEN.NAME[rownumber],DataND$GIVEN.NAME, method='lv')
birthDate <- stringdist(DataND$DOB[rownumber],DataND$DOB, method='lv')
streetAddress<-stringdist(DataND$ADDR.1[rownumber],DataND$ADDR.1, method='lv')
suburb <- stringdist(DataND$LOCALITY[rownumber],DataND$LOCALITY, method='lv')
Upvotes: 3
Views: 926
Reputation: 6441
H 1's idea is great. Another option would be the fuzzyjoin
-package.
library(fuzzyjoin)
library(dplyr)
df <- tibble(id = seq(1,10),
words = replicate(10, paste(sample(LETTERS, 5), collapse = "")))
stringdist_left_join(df, df, by = c(words = "words"), max_dist = 5, method = "lv", distance_col = "distance") %>%
filter(distance != 0)
# A tibble: 90 x 5
id.x words.x id.y words.y distance
<int> <chr> <int> <chr> <dbl>
1 1 JUQYR 2 HQMFD 5
2 1 JUQYR 3 WHQOM 4
3 1 JUQYR 4 OUWJV 4
4 1 JUQYR 5 JURGD 3
5 1 JUQYR 6 ZMLAQ 5
6 1 JUQYR 7 RWLVU 5
7 1 JUQYR 8 AYNLE 5
8 1 JUQYR 9 AUPVJ 4
9 1 JUQYR 10 JDFEY 4
10 2 HQMFD 1 JUQYR 5
# ... with 80 more rows
Here you have it all set up in the end, you can pick and dismiss rows by distance. It took 11 seconds for 100.000 records. Trying with stringdistmatrix()
however I got the error:
Error: cannot allocate vector of size 37.3 Gb
Upvotes: 2
Reputation: 2185
lastName<-stringdist(DataND$SURNAME[rownumber], DataND$SURNAME, method='lv')
If i understand this line, it compar one surname (according the value of rownumber) with aller surnames. So when you change rownumber
, all comparisons are made, even the ones already done precedently.
To prevent this, try:
lastName<-stringdist(DataND$SURNAME[rownumber], DataND$SURNAME[rownumber:nrows], method='lv')
where nrows
is the number of rows
Upvotes: 0