Reputation: 1
I have two datasets each containing Name, First Name, Street, House Number, Postal Code and City. I have noticed these datasets contain multiple cases of duplicates. For instance, in one dataset the first name is "John", in the other dataset it is "Jon" with the same last name, same street, same house number, same city, and postal code also differs in one digit. As I have millions of data points, there are multiple possibilities of how the same person could present differences in those two datasets.
I believe I need to do five things:
Examples of cases: Name different; Name and Postal Code different; First Name different; First Name and Name different; City different, etc.
For instance, from "Jon" to "John" I need to add 1 letter and from postal code 34567 to 34568 I need to change 1 digit, resulting in two changes for this case.
Case | Frequency (%) |
---|---|
Name different | 50 |
Name and Postal Code different | 30 |
Case | Distance Measure | Frequency (%) |
---|---|---|
Name different | 1 | 50 |
Name different | 2 | 30 |
Name different | 3 | 20 |
Name and Postal Code different | 3 | 100 |
Case/Distance Measure | 0 | 1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|---|---|
Name different | 20% | 10% | 10% | 5% | 5% | 50% |
Name and Postal Code different | 10% | 30% | 10% | 20% | 5% | 25% |
Could you please help me identify which libraries I would need to perform these steps with Python and Jupyter?
For now I do not have access to the data and I am doing research.
Upvotes: 0
Views: 45