Reputation: 5897
I have the following dataset in R:
address = c( "44 Ocean Road Atlanta Georgia", "882 4N Road River NY, NY 12345", "882 - River Road NY, ZIP 12345", "123 Fake Road Boston Drive Boston", "123 Fake - Rd Boston 56789", "3665 Apt 5 Moon Crs", "3665 Unit Moon Crescent", "NO ADDRESS PROVIDED", "31 Silver Way Road", "1800 Orleans St, Baltimore, MD 21287, United States",
"1799 Orlans Street, Maryland , USA")
name = c("Pancake House of America" ,"ABC Center Building", "Cent. Bldg ABC", "BD Home 25 New", "Boarding Direct 25", "Pine Recreational Center", "Pine Rec. cntR", "Boston Swimming Complex", "boston gym center", "mas hospital" , "Massachusetts Hospital" )
blocking_var = c(1, 1,1,1, 1, 2,2,2,2,3,3)
my_data = data.frame(address, name, blocking_var)
The data looks something like this:
> my_data
address name blocking_var
1 44 Ocean Road Atlanta Georgia Pancake House of America 1
2 882 4N Road River NY, NY 12345 ABC Center Building 1
3 882 - River Road NY, ZIP 12345 Cent. Bldg ABC 1
4 123 Fake Road Boston Drive Boston BD Home 25 New 1
5 123 Fake - Rd Boston 56789 Boarding Direct 25 1
6 3665 Apt 5 Moon Crs Pine Recreational Center 2
7 3665 Unit Moon Crescent Pine Rec. cntR 2
8 NO ADDRESS PROVIDED Boston Swimming Complex 2
9 31 Silver Way Road boston gym center 2
10 1800 Orleans St, Baltimore, MD 21287, United States mas hospital 3
11 1799 Orlans Street, Maryland , USA Massachusetts Hospital 3
I am trying to follow this R tutorial (https://cran.r-project.org/web/packages/RecordLinkage/vignettes/WeightBased.pdf) and learn how to remove duplicates based on fuzzy conditions. The goal (within each "block") is to keep all unique records - and for fuzzy duplicates, only keep one occurrence of the duplicate.
I tried the following code:
library(RecordLinkage)
pairs=compare.dedup(my_data, blockfld=3)
But when I inspect the results, everything is NA
- given these results, I think I am doing something wrong and there does not seem to be any point in continuing until this error is resolved.
Can someone please show me how I can resolve this problem and continue on with the tutorial?
In the end, I am looking for something like this:
address name blocking_var
1 44 Ocean Road Atlanta Georgia Pancake House of America 1
2 882 4N Road River NY, NY 12345 ABC Center Building 1
4 123 Fake Road Boston Drive Boston BD Home 25 New 1
6 3665 Apt 5 Moon Crs Pine Recreational Center 2
9 31 Silver Way Road boston gym center 2
10 1800 Orleans St, Baltimore, MD 21287, United States mas hospital 3
Thank you!
Upvotes: 0
Views: 981
Reputation: 621
You forgot to enable the string comparison on columns (strcmp
parameter):
address = c(
"44 Ocean Road Atlanta Georgia", "882 4N Road River NY, NY 12345", "882 - River Road NY, ZIP 12345", "123 Fake Road Boston Drive Boston", "123 Fake - Rd Boston 56789", "3665 Apt 5 Moon Crs", "3665 Unit Moon Crescent", "NO ADDRESS PROVIDED", "31 Silver Way Road", "1800 Orleans St, Baltimore, MD 21287, United States",
"1799 Orlans Street, Maryland , USA")
name = c("Pancake House of America" ,"ABC Center Building", "Cent. Bldg ABC", "BD Home 25 New", "Boarding Direct 25", "Pine Recreational Center", "Pine Rec. cntR", "Boston Swimming Complex", "boston gym center", "mas hospital" , "Massachusetts Hospital" )
blocking_var = c(1, 1,1,1, 1, 2,2,2,2,3,3)
my_data = data.frame(address, name, blocking_var)
library(RecordLinkage)
pairs <- compare.dedup(my_data, blockfld=3, strcmp = c("address", "name"))
pairs
#> $data
#> address name
#> 1 44 Ocean Road Atlanta Georgia Pancake House of America
#> 2 882 4N Road River NY, NY 12345 ABC Center Building
#> 3 882 - River Road NY, ZIP 12345 Cent. Bldg ABC
#> 4 123 Fake Road Boston Drive Boston BD Home 25 New
#> 5 123 Fake - Rd Boston 56789 Boarding Direct 25
#> 6 3665 Apt 5 Moon Crs Pine Recreational Center
#> 7 3665 Unit Moon Crescent Pine Rec. cntR
#> 8 NO ADDRESS PROVIDED Boston Swimming Complex
#> 9 31 Silver Way Road boston gym center
#> 10 1800 Orleans St, Baltimore, MD 21287, United States mas hospital
#> 11 1799 Orlans Street, Maryland , USA Massachusetts Hospital
#> blocking_var
#> 1 1
#> 2 1
#> 3 1
#> 4 1
#> 5 1
#> 6 2
#> 7 2
#> 8 2
#> 9 2
#> 10 3
#> 11 3
#>
#> $pairs
#> id1 id2 address name blocking_var is_match
#> 1 1 2 0.4657088 0.5014620 1 NA
#> 2 1 3 0.4256705 0.4551587 1 NA
#> 3 1 4 0.5924184 0.4543651 1 NA
#> 4 1 5 0.5139994 0.4768519 1 NA
#> 5 2 3 0.9082051 0.5802005 1 NA
#> 6 2 4 0.5112554 0.4734336 1 NA
#> 7 2 5 0.5094017 0.5467836 1 NA
#> 8 3 4 0.4767677 0.4404762 1 NA
#> 9 3 5 0.5418803 0.4761905 1 NA
#> 10 4 5 0.8550583 0.6672619 1 NA
#> 11 6 7 0.8749962 0.8306277 1 NA
#> 12 6 8 0.4385965 0.5243193 1 NA
#> 13 6 9 0.5622807 0.5502822 1 NA
#> 14 7 8 0.3974066 0.5075914 1 NA
#> 15 7 9 0.5626812 0.5896359 1 NA
#> 16 8 9 0.3942495 0.6478338 1 NA
#> 17 10 11 0.6939076 0.6843434 1 NA
#>
#> $frequencies
#> address name blocking_var
#> 0.09090909 0.09090909 0.33333333
#>
#> $type
#> [1] "deduplication"
#>
#> attr(,"class")
#> [1] "RecLinkData"
It then goes like this, using e.g. the EpiLink algorithm:
# Compute EpiLink weights
pairs_w <- epiWeights(pairs)
# Explore the pairs and their weight to find a good cutoff
getPairs(pairs_w, min.weight=0.6, max.weight=0.8)
#> id address
#> 1 2 882 4N Road River NY, NY 12345
#> 2 3 882 - River Road NY, ZIP 12345
#> 3
#> 4 10 1800 Orleans St, Baltimore, MD 21287, United States
#> 5 11 1799 Orlans Street, Maryland , USA
#> 6
#> 7 7 3665 Unit Moon Crescent
#> 8 9 31 Silver Way Road
#> 9
#> 10 6 3665 Apt 5 Moon Crs
#> 11 9 31 Silver Way Road
#> 12
#> 13 2 882 4N Road River NY, NY 12345
#> 14 5 123 Fake - Rd Boston 56789
#> 15
#> 16 1 44 Ocean Road Atlanta Georgia
#> 17 4 123 Fake Road Boston Drive Boston
#> 18
#> 19 8 NO ADDRESS PROVIDED
#> 20 9 31 Silver Way Road
#> 21
#> 22 3 882 - River Road NY, ZIP 12345
#> 23 5 123 Fake - Rd Boston 56789
#> 24
#> name blocking_var Weight
#> 1 ABC Center Building 1
#> 2 Cent. Bldg ABC 1 0.7916856
#> 3
#> 4 mas hospital 3
#> 5 Massachusetts Hospital 3 0.7468321
#> 6
#> 7 Pine Rec. cntR 2
#> 8 boston gym center 2 0.6548348
#> 9
#> 10 Pine Recreational Center 2
#> 11 boston gym center 2 0.6386475
#> 12
#> 13 ABC Center Building 1
#> 14 Boarding Direct 25 1 0.6156913
#> 15
#> 16 Pancake House of America 1
#> 17 BD Home 25 New 1 0.6118630
#> 18
#> 19 Boston Swimming Complex 2
#> 20 boston gym center 2 0.6099491
#> 21
#> 22 Cent. Bldg ABC 1
#> 23 Boarding Direct 25 1 0.6001716
#> 24
I chose > 0.7
to classify as link, < 0.6
to classify as a non-link.
Matches in-between are labelled as "possible".
pairs_class <- epiClassify(pairs_w, threshold.upper = 0.7, threshold.lower = 0.6)
summary(pairs_class)
#>
#> Deduplication Data Set
#>
#> 11 records
#> 17 record pairs
#>
#> 0 matches
#> 0 non-matches
#> 17 pairs with unknown status
#>
#>
#> Weight distribution:
#>
#> [0.5,0.55] (0.55,0.6] (0.6,0.65] (0.65,0.7] (0.7,0.75] (0.75,0.8] (0.8,0.85]
#> 1 6 5 1 1 1 1
#> (0.85,0.9]
#> 1
#>
#> 4 links detected
#> 6 possible links detected
#> 7 non-links detected
#>
#> Classification table:
#>
#> classification
#> true status N P L
#> <NA> 7 6 4
And the results:
# detected links, possible matches, non-links
getPairs(pairs_class, show = "links")
#> id address
#> 1 6 3665 Apt 5 Moon Crs
#> 2 7 3665 Unit Moon Crescent
#> 3
#> 4 4 123 Fake Road Boston Drive Boston
#> 5 5 123 Fake - Rd Boston 56789
#> 6
#> 7 2 882 4N Road River NY, NY 12345
#> 8 3 882 - River Road NY, ZIP 12345
#> 9
#> 10 10 1800 Orleans St, Baltimore, MD 21287, United States
#> 11 11 1799 Orlans Street, Maryland , USA
#> 12
#> name blocking_var Weight
#> 1 Pine Recreational Center 2
#> 2 Pine Rec. cntR 2 0.8801340
#> 3
#> 4 BD Home 25 New 1
#> 5 Boarding Direct 25 1 0.8054952
#> 6
#> 7 ABC Center Building 1
#> 8 Cent. Bldg ABC 1 0.7916856
#> 9
#> 10 mas hospital 3
#> 11 Massachusetts Hospital 3 0.7468321
#> 12
getPairs(pairs_class, show = "possible")
#> id address name blocking_var
#> 1 7 3665 Unit Moon Crescent Pine Rec. cntR 2
#> 2 9 31 Silver Way Road boston gym center 2
#> 3
#> 4 6 3665 Apt 5 Moon Crs Pine Recreational Center 2
#> 5 9 31 Silver Way Road boston gym center 2
#> 6
#> 7 2 882 4N Road River NY, NY 12345 ABC Center Building 1
#> 8 5 123 Fake - Rd Boston 56789 Boarding Direct 25 1
#> 9
#> 10 1 44 Ocean Road Atlanta Georgia Pancake House of America 1
#> 11 4 123 Fake Road Boston Drive Boston BD Home 25 New 1
#> 12
#> 13 8 NO ADDRESS PROVIDED Boston Swimming Complex 2
#> 14 9 31 Silver Way Road boston gym center 2
#> 15
#> 16 3 882 - River Road NY, ZIP 12345 Cent. Bldg ABC 1
#> 17 5 123 Fake - Rd Boston 56789 Boarding Direct 25 1
#> 18
#> Weight
#> 1
#> 2 0.6548348
#> 3
#> 4
#> 5 0.6386475
#> 6
#> 7
#> 8 0.6156913
#> 9
#> 10
#> 11 0.6118630
#> 12
#> 13
#> 14 0.6099491
#> 15
#> 16
#> 17 0.6001716
#> 18
getPairs(pairs_class, show = "nonlinks")
#> id address name blocking_var
#> 1 1 44 Ocean Road Atlanta Georgia Pancake House of America 1
#> 2 5 123 Fake - Rd Boston 56789 Boarding Direct 25 1
#> 3
#> 4 2 882 4N Road River NY, NY 12345 ABC Center Building 1
#> 5 4 123 Fake Road Boston Drive Boston BD Home 25 New 1
#> 6
#> 7 1 44 Ocean Road Atlanta Georgia Pancake House of America 1
#> 8 2 882 4N Road River NY, NY 12345 ABC Center Building 1
#> 9
#> 10 6 3665 Apt 5 Moon Crs Pine Recreational Center 2
#> 11 8 NO ADDRESS PROVIDED Boston Swimming Complex 2
#> 12
#> 13 3 882 - River Road NY, ZIP 12345 Cent. Bldg ABC 1
#> 14 4 123 Fake Road Boston Drive Boston BD Home 25 New 1
#> 15
#> 16 7 3665 Unit Moon Crescent Pine Rec. cntR 2
#> 17 8 NO ADDRESS PROVIDED Boston Swimming Complex 2
#> 18
#> 19 1 44 Ocean Road Atlanta Georgia Pancake House of America 1
#> 20 3 882 - River Road NY, ZIP 12345 Cent. Bldg ABC 1
#> 21
#> Weight
#> 1
#> 2 0.5890881
#> 3
#> 4
#> 5 0.5865789
#> 6
#> 7
#> 8 0.5794458
#> 9
#> 10
#> 11 0.5777132
#> 12
#> 13
#> 14 0.5591162
#> 15
#> 16
#> 17 0.5541298
#> 18
#> 19
#> 20 0.5442886
#> 21
Created on 2022-11-17 with reprex v2.0.2
Upvotes: 2