pbh
pbh

Reputation: 452

Dedupe Library issue with csv file

I am trying to learn dedupe library by running one very small example . I am getting some error . PLease help

import dedupe
from Levenshtein import distance
# Define similarity functions - customize based on your matching criteria
def name_similarity(s1, s2):
    # Implement your name comparison logic here (e.g., Levenshtein distance, etc.)
    distance1 = distance(s1, s2)
    similarity = 1 - (distance1 / max(len(s1), len(s2)))  # Normalize distance to 0-1 similarity
    return similarity


if __name__ == '__main__':
# Sample data (list of dictionaries)
    data = {18709931: {'id': '18709931', 'name': 'TEST', 'ent_num': '8256364', 'ent_nm_txt': 'TST Corporation'},
            18484906: {'id': '18484906', 'name': 'VESTCOM', 'ent_num': '8256364', 'ent_nm_txt': 'TST Corporation'},
            18709961: {'id': '18709961', 'name': 'TESTMATERIALS', 'ent_num': '8256364', 'ent_nm_txt': 'TST Corporation'},
            19415694: {'id': '19415694', 'name': 'TEST', 'ent_num': '8256364', 'ent_nm_txt': 'TST Corporation'}}




    # Define a schema
    fields = [
        {'field': 'name', 'type': 'Custom', 'comparator': name_similarity},
        {'field': 'ent_num', 'type': 'Exact'},

    ]

    # Initialize a deduper
    deduper = dedupe.Dedupe(fields)

    # Active learning loop to label examples
    deduper.prepare_training(data)

    # Active learning loop
    dedupe.console_label(deduper)

    # Train the deduper
    deduper.train()

    # Save the trained model to disk
    with open('dedupe_model.pickle', 'wb') as f:
        dedupe.pickle.dump(deduper, f)

error I am getting while running training

Traceback (most recent call last): File "C:\Python_Projects\Python_extra_code\test_dedupe_code.py", line 30, in deduper.prepare_training(data) File "C:\Dev\Python3.11\Lib\site-packages\dedupe\api.py", line 1424, in prepare_training self.active_learner = labeler.DedupeDisagreementLearner( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Dev\Python3.11\Lib\site-packages\dedupe\labeler.py", line 430, in init self.mark(examples, labels) File "C:\Dev\Python3.11\Lib\site-packages\dedupe\labeler.py", line 391, in mark learner.fit(self.pairs, self.y) File "C:\Dev\Python3.11\Lib\site-packages\dedupe\labeler.py", line 117, in fit self.current_predicates = self.block_learner.learn( ^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Dev\Python3.11\Lib\site-packages\dedupe\training.py", line 58, in learn coverable_dupes = frozenset.union(*match_cover.values()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ TypeError: unbound method frozenset.union() needs an argument

Process finished with exit code 1

Upvotes: 2

Views: 100

Answers (1)

Johnny Cheesecutter
Johnny Cheesecutter

Reputation: 2853

The problem is that dedupee tries to use ent_num for creating blocks, but you have the single value in this field (i.e. always equal to some value like 8256364).

If you want to find duplicates only in name inside single value of ent_num=8256364 the best advise would be to use name both for Custom distance and for blocking (and basically exclude ent_num):

[ 
{'field': 'name', 'type': 'Custom', 'comparator': name_similarity}, 
{'field': 'name', 'type': 'String'}  # will allow to use for block search
] 

UPD: updated based on comment from pbh

Upvotes: 0

Related Questions