Reputation: 3375
I'm following various tutorials for python dedupe online, but keep coming across this error whichever one I try:
ValueError: Records do not line up with data model. The field 'firstname ' is in data_model but not in a record
Somebody on their github had the same issue: https://github.com/dedupeio/csvdedupe/issues/55, and the dev said that the training examples have to have whatever record is in this error message.
My data has the firstname
records, as does the fields variable.
Data to be deduped:
{76550: {'id': '76550',
'title': 'mrs',
'firstname': 'mary',
'lastname': 'fakename',
'email': '[email protected]',
'phone': None,
'mobile': '353870748',
etc etc etc}
and here are the fields:
fields = [
{'field' : 'firstname ', 'type': 'String','has missing' : True},
{'field' : 'lastname ', 'type': 'String','has missing' : True},
{'field' : 'email', 'type': 'String','has missing' : True},
{'field' : 'address1', 'type': 'String', 'has missing' : True},
{'field' : 'mobile', 'type': 'String', 'has missing' : True},
]
The error is caused here:
# Pass in our model
deduper = dedupe.Dedupe(fields)
# Feed some sample data in ... 1500 records
deduper.sample(df, 1500)
ValueError Traceback (most recent call last)
<ipython-input-89-e34caa52a74c> in <module>
2
3 # Feed some sample data in ... 15000 records
----> 4 deduper.sample(df, 1500)
~\Anaconda3\envs\Tensorflow\lib\site-packages\dedupe\api.py in sample(self, data, sample_size, blocked_proportion, original_length)
789 a sample of full data
790 '''
--> 791 self._checkData(data)
792
793 self.active_learner = self.ActiveLearner(self.data_model,
~\Anaconda3\envs\Tensorflow\lib\site-packages\dedupe\api.py in _checkData(self, data)
802 'Dictionary of records is empty.')
803
--> 804 self.data_model.check(next(iter(viewvalues(data))))
805
806
~\Anaconda3\envs\Tensorflow\lib\site-packages\dedupe\datamodel.py in check(self, record)
119 raise ValueError("Records do not line up with data model. "
120 "The field '%s' is in data_model but not "
--> 121 "in a record" % field)
122
123
ValueError: Records do not line up with data model. The field 'firstname ' is in data_model but not in a record
Both have firstname
in them.
Where am I going wrong?
I have tried transposing the dataframe and converting to dict in all sorts of ways. I can't get it to work.
Upvotes: 0
Views: 543
Reputation: 25
It's file encoding. In my case both files needed to be saved as UTF-8 with Unix lf otherwise Dedupe inserts extra spaces.
Upvotes: 0
Reputation: 3249
the problem is that in your field definition you have an extra space
you want
'firstname'
not
'firstname '
Upvotes: 1