Reputation: 452
I am trying to learn dedupe library by running one very small example . I am getting some error . PLease help
import dedupe
from Levenshtein import distance
# Define similarity functions - customize based on your matching criteria
def name_similarity(s1, s2):
# Implement your name comparison logic here (e.g., Levenshtein distance, etc.)
distance1 = distance(s1, s2)
similarity = 1 - (distance1 / max(len(s1), len(s2))) # Normalize distance to 0-1 similarity
return similarity
if __name__ == '__main__':
# Sample data (list of dictionaries)
data = {18709931: {'id': '18709931', 'name': 'TEST', 'ent_num': '8256364', 'ent_nm_txt': 'TST Corporation'},
18484906: {'id': '18484906', 'name': 'VESTCOM', 'ent_num': '8256364', 'ent_nm_txt': 'TST Corporation'},
18709961: {'id': '18709961', 'name': 'TESTMATERIALS', 'ent_num': '8256364', 'ent_nm_txt': 'TST Corporation'},
19415694: {'id': '19415694', 'name': 'TEST', 'ent_num': '8256364', 'ent_nm_txt': 'TST Corporation'}}
# Define a schema
fields = [
{'field': 'name', 'type': 'Custom', 'comparator': name_similarity},
{'field': 'ent_num', 'type': 'Exact'},
]
# Initialize a deduper
deduper = dedupe.Dedupe(fields)
# Active learning loop to label examples
deduper.prepare_training(data)
# Active learning loop
dedupe.console_label(deduper)
# Train the deduper
deduper.train()
# Save the trained model to disk
with open('dedupe_model.pickle', 'wb') as f:
dedupe.pickle.dump(deduper, f)
error I am getting while running training
Traceback (most recent call last): File "C:\Python_Projects\Python_extra_code\test_dedupe_code.py", line 30, in deduper.prepare_training(data) File "C:\Dev\Python3.11\Lib\site-packages\dedupe\api.py", line 1424, in prepare_training self.active_learner = labeler.DedupeDisagreementLearner( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Dev\Python3.11\Lib\site-packages\dedupe\labeler.py", line 430, in init self.mark(examples, labels) File "C:\Dev\Python3.11\Lib\site-packages\dedupe\labeler.py", line 391, in mark learner.fit(self.pairs, self.y) File "C:\Dev\Python3.11\Lib\site-packages\dedupe\labeler.py", line 117, in fit self.current_predicates = self.block_learner.learn( ^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Dev\Python3.11\Lib\site-packages\dedupe\training.py", line 58, in learn coverable_dupes = frozenset.union(*match_cover.values()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ TypeError: unbound method frozenset.union() needs an argument
Process finished with exit code 1
Upvotes: 2
Views: 100
Reputation: 2853
The problem is that dedupee tries to use ent_num
for creating blocks, but you have the single value in this field (i.e. always equal to some value like 8256364
).
If you want to find duplicates only in name
inside single value of ent_num=8256364
the best advise would be to use name
both for Custom distance and for blocking (and basically exclude ent_num
):
[
{'field': 'name', 'type': 'Custom', 'comparator': name_similarity},
{'field': 'name', 'type': 'String'} # will allow to use for block search
]
UPD: updated based on comment from pbh
Upvotes: 0