SCool
SCool

Reputation: 3375

Python Dedupe Package Error: "Records do not line up with data model". But everything looks OK

I'm following various tutorials for python dedupe online, but keep coming across this error whichever one I try:

ValueError: Records do not line up with data model. The field 'firstname ' is in data_model but not in a record

Somebody on their github had the same issue: https://github.com/dedupeio/csvdedupe/issues/55, and the dev said that the training examples have to have whatever record is in this error message.

My data has the firstname records, as does the fields variable.

Data to be deduped:


{76550: {'id': '76550',
  'title': 'mrs',
  'firstname': 'mary',
  'lastname': 'fakename',
  'email': '[email protected]',
  'phone': None,
  'mobile': '353870748',
   etc etc etc}

and here are the fields:


fields = [
        {'field' : 'firstname ', 'type': 'String','has missing' : True},
        {'field' : 'lastname ', 'type': 'String','has missing' : True},
        {'field' : 'email', 'type': 'String','has missing' : True},
        {'field' : 'address1', 'type': 'String', 'has missing' : True},
        {'field' : 'mobile', 'type': 'String', 'has missing' : True},
        ]

The error is caused here:


# Pass in our model
deduper = dedupe.Dedupe(fields)

# Feed some sample data in ... 1500 records
deduper.sample(df, 1500)

ValueError                                Traceback (most recent call last)
<ipython-input-89-e34caa52a74c> in <module>
      2 
      3 # Feed some sample data in ... 15000 records
----> 4 deduper.sample(df, 1500)

~\Anaconda3\envs\Tensorflow\lib\site-packages\dedupe\api.py in sample(self, data, sample_size, blocked_proportion, original_length)
    789                                a sample of full data
    790         '''
--> 791         self._checkData(data)
    792 
    793         self.active_learner = self.ActiveLearner(self.data_model,

~\Anaconda3\envs\Tensorflow\lib\site-packages\dedupe\api.py in _checkData(self, data)
    802                 'Dictionary of records is empty.')
    803 
--> 804         self.data_model.check(next(iter(viewvalues(data))))
    805 
    806 

~\Anaconda3\envs\Tensorflow\lib\site-packages\dedupe\datamodel.py in check(self, record)
    119                 raise ValueError("Records do not line up with data model. "
    120                                  "The field '%s' is in data_model but not "
--> 121                                  "in a record" % field)
    122 
    123 

ValueError: Records do not line up with data model. The field 'firstname ' is in data_model but not in a record

Both have firstname in them.

Where am I going wrong?

I have tried transposing the dataframe and converting to dict in all sorts of ways. I can't get it to work.

Upvotes: 0

Views: 543

Answers (2)

Sarah Eaglesfield
Sarah Eaglesfield

Reputation: 25

It's file encoding. In my case both files needed to be saved as UTF-8 with Unix lf otherwise Dedupe inserts extra spaces.

Upvotes: 0

fgregg
fgregg

Reputation: 3249

the problem is that in your field definition you have an extra space

you want

'firstname'

not

'firstname '

Upvotes: 1

Related Questions