stats_noob
stats_noob

Reputation: 5897

Performing Record Linkage in R

I have the following dataset in R:

address = c( "44 Ocean Road Atlanta Georgia", "882 4N Road River NY, NY 12345", "882 - River Road NY, ZIP 12345", "123 Fake Road Boston Drive Boston", "123 Fake - Rd Boston 56789", "3665 Apt 5 Moon Crs", "3665 Unit Moon Crescent", "NO ADDRESS PROVIDED", "31 Silver Way Road", "1800 Orleans St, Baltimore, MD 21287, United States", 
"1799 Orlans Street, Maryland , USA")
            
 name = c("Pancake House of America" ,"ABC Center Building", "Cent. Bldg ABC", "BD Home 25 New", "Boarding Direct 25", "Pine Recreational Center", "Pine Rec. cntR", "Boston Swimming Complex", "boston gym center", "mas hospital" , "Massachusetts Hospital" )

blocking_var = c(1, 1,1,1, 1, 2,2,2,2,3,3)
            
my_data = data.frame(address, name, blocking_var)

The data looks something like this:

> my_data
                                               address                     name blocking_var
1                        44 Ocean Road Atlanta Georgia Pancake House of America            1
2                       882 4N Road River NY, NY 12345      ABC Center Building            1
3                       882 - River Road NY, ZIP 12345           Cent. Bldg ABC            1
4                    123 Fake Road Boston Drive Boston           BD Home 25 New            1
5                           123 Fake - Rd Boston 56789       Boarding Direct 25            1
6                                  3665 Apt 5 Moon Crs Pine Recreational Center            2
7                              3665 Unit Moon Crescent           Pine Rec. cntR            2
8                                  NO ADDRESS PROVIDED  Boston Swimming Complex            2
9                                   31 Silver Way Road        boston gym center            2
10 1800 Orleans St, Baltimore, MD 21287, United States             mas hospital            3
11                  1799 Orlans Street, Maryland , USA   Massachusetts Hospital            3


   

I am trying to follow this R tutorial (https://cran.r-project.org/web/packages/RecordLinkage/vignettes/WeightBased.pdf) and learn how to remove duplicates based on fuzzy conditions. The goal (within each "block") is to keep all unique records - and for fuzzy duplicates, only keep one occurrence of the duplicate.

I tried the following code:

library(RecordLinkage)
pairs=compare.dedup(my_data, blockfld=3)

But when I inspect the results, everything is NA - given these results, I think I am doing something wrong and there does not seem to be any point in continuing until this error is resolved.

Can someone please show me how I can resolve this problem and continue on with the tutorial?

In the end, I am looking for something like this:

                                               address                     name blocking_var
1                        44 Ocean Road Atlanta Georgia Pancake House of America            1
2                       882 4N Road River NY, NY 12345      ABC Center Building            1
4                    123 Fake Road Boston Drive Boston           BD Home 25 New            1
6                                  3665 Apt 5 Moon Crs Pine Recreational Center            2
9                                   31 Silver Way Road        boston gym center            2
10 1800 Orleans St, Baltimore, MD 21287, United States             mas hospital            3

Thank you!

Upvotes: 0

Views: 981

Answers (1)

Lorenzo G
Lorenzo G

Reputation: 621

You forgot to enable the string comparison on columns (strcmp parameter):

address = c(
   "44 Ocean Road Atlanta Georgia", "882 4N Road River NY, NY 12345", "882 - River Road NY, ZIP 12345", "123 Fake Road Boston Drive Boston", "123 Fake - Rd Boston 56789", "3665 Apt 5 Moon Crs", "3665 Unit Moon Crescent", "NO ADDRESS PROVIDED", "31 Silver Way Road", "1800 Orleans St, Baltimore, MD 21287, United States", 
   "1799 Orlans Street, Maryland , USA")

name = c("Pancake House of America" ,"ABC Center Building", "Cent. Bldg ABC", "BD Home 25 New", "Boarding Direct 25", "Pine Recreational Center", "Pine Rec. cntR", "Boston Swimming Complex", "boston gym center", "mas hospital" , "Massachusetts Hospital" )

blocking_var = c(1, 1,1,1, 1, 2,2,2,2,3,3)

my_data = data.frame(address, name, blocking_var)


library(RecordLinkage)

pairs <- compare.dedup(my_data, blockfld=3, strcmp = c("address", "name"))
pairs
#> $data
#>                                                address                     name
#> 1                        44 Ocean Road Atlanta Georgia Pancake House of America
#> 2                       882 4N Road River NY, NY 12345      ABC Center Building
#> 3                       882 - River Road NY, ZIP 12345           Cent. Bldg ABC
#> 4                    123 Fake Road Boston Drive Boston           BD Home 25 New
#> 5                           123 Fake - Rd Boston 56789       Boarding Direct 25
#> 6                                  3665 Apt 5 Moon Crs Pine Recreational Center
#> 7                              3665 Unit Moon Crescent           Pine Rec. cntR
#> 8                                  NO ADDRESS PROVIDED  Boston Swimming Complex
#> 9                                   31 Silver Way Road        boston gym center
#> 10 1800 Orleans St, Baltimore, MD 21287, United States             mas hospital
#> 11                  1799 Orlans Street, Maryland , USA   Massachusetts Hospital
#>    blocking_var
#> 1             1
#> 2             1
#> 3             1
#> 4             1
#> 5             1
#> 6             2
#> 7             2
#> 8             2
#> 9             2
#> 10            3
#> 11            3
#> 
#> $pairs
#>    id1 id2   address      name blocking_var is_match
#> 1    1   2 0.4657088 0.5014620            1       NA
#> 2    1   3 0.4256705 0.4551587            1       NA
#> 3    1   4 0.5924184 0.4543651            1       NA
#> 4    1   5 0.5139994 0.4768519            1       NA
#> 5    2   3 0.9082051 0.5802005            1       NA
#> 6    2   4 0.5112554 0.4734336            1       NA
#> 7    2   5 0.5094017 0.5467836            1       NA
#> 8    3   4 0.4767677 0.4404762            1       NA
#> 9    3   5 0.5418803 0.4761905            1       NA
#> 10   4   5 0.8550583 0.6672619            1       NA
#> 11   6   7 0.8749962 0.8306277            1       NA
#> 12   6   8 0.4385965 0.5243193            1       NA
#> 13   6   9 0.5622807 0.5502822            1       NA
#> 14   7   8 0.3974066 0.5075914            1       NA
#> 15   7   9 0.5626812 0.5896359            1       NA
#> 16   8   9 0.3942495 0.6478338            1       NA
#> 17  10  11 0.6939076 0.6843434            1       NA
#> 
#> $frequencies
#>      address         name blocking_var 
#>   0.09090909   0.09090909   0.33333333 
#> 
#> $type
#> [1] "deduplication"
#> 
#> attr(,"class")
#> [1] "RecLinkData"

It then goes like this, using e.g. the EpiLink algorithm:


# Compute EpiLink weights
pairs_w <- epiWeights(pairs)

# Explore the pairs and their weight to find a good cutoff

getPairs(pairs_w, min.weight=0.6, max.weight=0.8)
#>    id                                             address
#> 1   2                      882 4N Road River NY, NY 12345
#> 2   3                      882 - River Road NY, ZIP 12345
#> 3                                                        
#> 4  10 1800 Orleans St, Baltimore, MD 21287, United States
#> 5  11                  1799 Orlans Street, Maryland , USA
#> 6                                                        
#> 7   7                             3665 Unit Moon Crescent
#> 8   9                                  31 Silver Way Road
#> 9                                                        
#> 10  6                                 3665 Apt 5 Moon Crs
#> 11  9                                  31 Silver Way Road
#> 12                                                       
#> 13  2                      882 4N Road River NY, NY 12345
#> 14  5                          123 Fake - Rd Boston 56789
#> 15                                                       
#> 16  1                       44 Ocean Road Atlanta Georgia
#> 17  4                   123 Fake Road Boston Drive Boston
#> 18                                                       
#> 19  8                                 NO ADDRESS PROVIDED
#> 20  9                                  31 Silver Way Road
#> 21                                                       
#> 22  3                      882 - River Road NY, ZIP 12345
#> 23  5                          123 Fake - Rd Boston 56789
#> 24                                                       
#>                        name blocking_var    Weight
#> 1       ABC Center Building            1          
#> 2            Cent. Bldg ABC            1 0.7916856
#> 3                                                 
#> 4              mas hospital            3          
#> 5    Massachusetts Hospital            3 0.7468321
#> 6                                                 
#> 7            Pine Rec. cntR            2          
#> 8         boston gym center            2 0.6548348
#> 9                                                 
#> 10 Pine Recreational Center            2          
#> 11        boston gym center            2 0.6386475
#> 12                                                
#> 13      ABC Center Building            1          
#> 14       Boarding Direct 25            1 0.6156913
#> 15                                                
#> 16 Pancake House of America            1          
#> 17           BD Home 25 New            1 0.6118630
#> 18                                                
#> 19  Boston Swimming Complex            2          
#> 20        boston gym center            2 0.6099491
#> 21                                                
#> 22           Cent. Bldg ABC            1          
#> 23       Boarding Direct 25            1 0.6001716
#> 24

I chose > 0.7 to classify as link, < 0.6 to classify as a non-link. Matches in-between are labelled as "possible".


pairs_class <- epiClassify(pairs_w, threshold.upper = 0.7, threshold.lower = 0.6)
summary(pairs_class)
#> 
#> Deduplication Data Set
#> 
#> 11 records 
#> 17 record pairs 
#> 
#> 0 matches
#> 0 non-matches
#> 17 pairs with unknown status
#> 
#> 
#> Weight distribution:
#> 
#> [0.5,0.55] (0.55,0.6] (0.6,0.65] (0.65,0.7] (0.7,0.75] (0.75,0.8] (0.8,0.85] 
#>          1          6          5          1          1          1          1 
#> (0.85,0.9] 
#>          1 
#> 
#> 4 links detected 
#> 6 possible links detected 
#> 7 non-links detected 
#> 
#> Classification table:
#> 
#>            classification
#> true status N P L
#>        <NA> 7 6 4

And the results:


# detected links, possible matches, non-links
getPairs(pairs_class, show = "links")
#>    id                                             address
#> 1   6                                 3665 Apt 5 Moon Crs
#> 2   7                             3665 Unit Moon Crescent
#> 3                                                        
#> 4   4                   123 Fake Road Boston Drive Boston
#> 5   5                          123 Fake - Rd Boston 56789
#> 6                                                        
#> 7   2                      882 4N Road River NY, NY 12345
#> 8   3                      882 - River Road NY, ZIP 12345
#> 9                                                        
#> 10 10 1800 Orleans St, Baltimore, MD 21287, United States
#> 11 11                  1799 Orlans Street, Maryland , USA
#> 12                                                       
#>                        name blocking_var    Weight
#> 1  Pine Recreational Center            2          
#> 2            Pine Rec. cntR            2 0.8801340
#> 3                                                 
#> 4            BD Home 25 New            1          
#> 5        Boarding Direct 25            1 0.8054952
#> 6                                                 
#> 7       ABC Center Building            1          
#> 8            Cent. Bldg ABC            1 0.7916856
#> 9                                                 
#> 10             mas hospital            3          
#> 11   Massachusetts Hospital            3 0.7468321
#> 12

getPairs(pairs_class, show = "possible")
#>    id                           address                     name blocking_var
#> 1   7           3665 Unit Moon Crescent           Pine Rec. cntR            2
#> 2   9                31 Silver Way Road        boston gym center            2
#> 3                                                                            
#> 4   6               3665 Apt 5 Moon Crs Pine Recreational Center            2
#> 5   9                31 Silver Way Road        boston gym center            2
#> 6                                                                            
#> 7   2    882 4N Road River NY, NY 12345      ABC Center Building            1
#> 8   5        123 Fake - Rd Boston 56789       Boarding Direct 25            1
#> 9                                                                            
#> 10  1     44 Ocean Road Atlanta Georgia Pancake House of America            1
#> 11  4 123 Fake Road Boston Drive Boston           BD Home 25 New            1
#> 12                                                                           
#> 13  8               NO ADDRESS PROVIDED  Boston Swimming Complex            2
#> 14  9                31 Silver Way Road        boston gym center            2
#> 15                                                                           
#> 16  3    882 - River Road NY, ZIP 12345           Cent. Bldg ABC            1
#> 17  5        123 Fake - Rd Boston 56789       Boarding Direct 25            1
#> 18                                                                           
#>       Weight
#> 1           
#> 2  0.6548348
#> 3           
#> 4           
#> 5  0.6386475
#> 6           
#> 7           
#> 8  0.6156913
#> 9           
#> 10          
#> 11 0.6118630
#> 12          
#> 13          
#> 14 0.6099491
#> 15          
#> 16          
#> 17 0.6001716
#> 18

getPairs(pairs_class, show = "nonlinks")
#>    id                           address                     name blocking_var
#> 1   1     44 Ocean Road Atlanta Georgia Pancake House of America            1
#> 2   5        123 Fake - Rd Boston 56789       Boarding Direct 25            1
#> 3                                                                            
#> 4   2    882 4N Road River NY, NY 12345      ABC Center Building            1
#> 5   4 123 Fake Road Boston Drive Boston           BD Home 25 New            1
#> 6                                                                            
#> 7   1     44 Ocean Road Atlanta Georgia Pancake House of America            1
#> 8   2    882 4N Road River NY, NY 12345      ABC Center Building            1
#> 9                                                                            
#> 10  6               3665 Apt 5 Moon Crs Pine Recreational Center            2
#> 11  8               NO ADDRESS PROVIDED  Boston Swimming Complex            2
#> 12                                                                           
#> 13  3    882 - River Road NY, ZIP 12345           Cent. Bldg ABC            1
#> 14  4 123 Fake Road Boston Drive Boston           BD Home 25 New            1
#> 15                                                                           
#> 16  7           3665 Unit Moon Crescent           Pine Rec. cntR            2
#> 17  8               NO ADDRESS PROVIDED  Boston Swimming Complex            2
#> 18                                                                           
#> 19  1     44 Ocean Road Atlanta Georgia Pancake House of America            1
#> 20  3    882 - River Road NY, ZIP 12345           Cent. Bldg ABC            1
#> 21                                                                           
#>       Weight
#> 1           
#> 2  0.5890881
#> 3           
#> 4           
#> 5  0.5865789
#> 6           
#> 7           
#> 8  0.5794458
#> 9           
#> 10          
#> 11 0.5777132
#> 12          
#> 13          
#> 14 0.5591162
#> 15          
#> 16          
#> 17 0.5541298
#> 18          
#> 19          
#> 20 0.5442886
#> 21

Created on 2022-11-17 with reprex v2.0.2

Upvotes: 2

Related Questions