lsignori
lsignori

Reputation: 15

Combine Rows to Have Same ID Based on Various Criteria

enter image description hereI have over 900,000 records of contacts and need to find a way to merge the records that have the same 'First Name' AND 'Last Name' AND either the same 'Street Address', 'Email', or 'Phone'. I want to keep the record with the most recent 'Last Modified Date' as the main record, and all the duplicates update their 'Contact ID' to the one of the main record. I also want to keep the original contact ID to track in a new column! Of course, I'd like to ignore Nulls when merging. My code is pretty long, so here is a link to download it (I don't want to take up too much space here): https://drive.google.com/file/d/1tJRr4IyHwHf2NF80j1LO7bTDakXRUtVM/view?usp=sharing

Upvotes: 0

Views: 328

Answers (2)

MDR
MDR

Reputation: 2670

Having rejected the first answer I posted and then updating the question with more requirements, please do note: This site is not a free code writing service. And the link to your code doesn't work (at least at the moment).

Given:

import pandas as pd
import numpy as np

df = pd.DataFrame({'First_Name': {0: 'Greg',  1: 'Greg',  2: 'John',  3: 'John',  4: 'Ryan',  5: 'Ryan'}, \
                    'Last_Name': {0: 'Li', 1: 'Li', 2: 'Doe', 3: 'Doe', 4: 'Lin', 5: 'Lin'}, \
                    'ContactID': {0: 123, 1: 1877, 2: 566, 3: 234, 4: 789, 5: 52}, \
                    'Last_Modified_Date': {0: '2021-04-08',  1: '2019-05-06',  2: '2018-02-03', \
                                           3: '2014-05-07',  4: '2019-06-07',  5: '2018-06-07'}, \
                    'Email': {0: '[email protected]',  1: '[email protected]',  2: '[email protected]',  \
                              3: '[email protected]',  4: '[email protected]',  5: np.nan}, \
                    'Address': {0: '44 Sherman',  1: np.nan,  2: '87 Branch Ave',  3: '87 Branch Ave', \
                                4: '84 Newport',  5: np.nan}, 'Phone': {0: '999-999-9999',  1: np.nan, \
                                                                     2: '890-523-4667',  3: np.nan,  4: \
                                                                     '678-900-000',  5: '678-900-000'}})

print(df)

enter image description here

Try:

df['Last_Modified_Date'] = pd.to_datetime(df['Last_Modified_Date'], format='%Y-%m-%d')
df = df.sort_values(by='Last_Modified_Date')
df['AllContactID'] = df['ContactID'].map(str)
df = df.replace(np.nan, '', regex=False)
df = df.groupby(by=['First_Name', 'Last_Name'], as_index=False)\
    .agg({'Last_Modified_Date': 'last', 'ContactID' : 'last', \
          'Email' : ', '.join, 'Address' : ', '.join, 'Phone' : ', '.join, 'AllContactID' : ', '.join})  
df = df.replace(r'(.*?)(,\s)\1', r', \1', regex=True)
df = df.replace(r'^, (.*)$', r'\1', regex=True)
df = df.replace(r', $', r'', regex=True)
#df['AllContactID'] = df.AllContactID.apply(lambda x: list(x.split(', ')))
#df['AllContactID'] = df.AllContactID.apply(lambda x: list(map(int, x)))

print(df)

enter image description here

Upvotes: 0

Vipul Dhariwal
Vipul Dhariwal

Reputation: 21

The idea is to sort the dataframe by modified date and then iterate through the rows saving the combination as a string in a dictionary as a key.

dataframe.sort(by = ['Last Modified Date'], ascending = False)
dataframe['old contact id'] = ""
dict1 = {}
for i in range(len(dataframe)):
    if dict1.get(str(dataframe['First Name'][i] + dataframe['Last Name'][i] + dataframe['email'][i])):
        df.loc[df['contact id'] == dict1[str(dataframe['First Name'][i] + dataframe['Last Name'][i] + dataframe['email'][i])]]['old contact id'] = df['contact id'][i]
    elif dict1.get(str(dataframe['First Name'][i] + dataframe['Last Name'][i] + dataframe['address'][i])):
        df.loc[df['contact id'] == dict1[str(dataframe['First Name'][i] + dataframe['Last Name'][i] + dataframe['address'][i])]]['old contact id'] = df['contact id'][i]
    elif dict1.get(str(dataframe['First Name'][i] + dataframe['Last Name'][i] + dataframe['phone'][i])):
        df.loc[df['contact id'] == dict1[str(dataframe['First Name'][i] + dataframe['Last Name'][i] + dataframe['phone'][i])]]['old contact id'] = df['contact id'][i]
    else:
        dict1[str(dataframe['First Name'][i] + dataframe['Last Name'][i] + dataframe['email'][i])] = = str(dataframe['Contact id'][i]
        dict1[str(dataframe['First Name'][i] + dataframe['Last Name'][i] + dataframe['address'][i])] = = str(dataframe['Contact id'][i]
        dict1[str(dataframe['First Name'][i] + dataframe['Last Name'][i] + dataframe['phone'][i])] = str(dataframe['Contact id'][i]

At last if the old modified is empty, delete the column.

dataframe = dataframe.loc[dataframe['old contact id'] != ""]

The code is long but after sorting it gives O(n) complexity.

Upvotes: 1

Related Questions