Brock Winfrey
Brock Winfrey

Reputation: 21

Use python/pandas to combine rows where duplicate values exist in one column

I want to use python to determine if the first instance of an ID value in the "Id" column has a match on a later row in that same column. If it does, then I want to take the value from the "Avail" column for the rows which match that initial "Id" value. Then I want to delete the rows with the duplicate Ids.

Here's my sample data: I have a CSV file that has data like this:

Id,First,Last,Avail  
abcdefg,John,Smith,4164667a-5dca-4ec6-a495-4be5b135d868=immediate  
dgasgas,Nancy,Adams,f98a8fbd-fb88-49b9-894e-631ba2a6f369=immediate  
gaytrjhu,John,Smith,e24ddf4c-c79f-4a84-a4ed-d92a10cc9e15=immediate  
abcdefg,John,Smith,3ec0c158-8782-41ff-8388-5a10b9261b60=immediate  
abcdefg,John,Smith,3ec0c158-8782-41ff-8388-c5dfe3b1276c=relative|7 

Desired output (v1) (Please note that I don't care about the "First" or "Last" columns from the duplicate rows. I only care about the "Avail" data from those:

Id,First,Last,Avail  
abcdefg,John,Smith,4164667a-5dca-4ec6-a495-4be5b135d868=immediate;3ec0c158-8782-41ff-8388-5a10b9261b60=immediate;3ec0c158-8782-41ff-8388-5a10b9261b60=immediate  
dgasgas,Nancy,Adams,f98a8fbd-fb88-49b9-894e-631ba2a6f369=immediate  
gaytrjhu,John,Smith,e24ddf4c-c79f-4a84-a4ed-d92a10cc9e15=immediate  
abcdefg,Nancy,Adams,3ec0c158-8782-41ff-8388-5a10b9261b60=immediate  
abcdefg,John,Smith,3ec0c158-8782-41ff-8388-c5dfe3b1276c=relative|7  

Then I'd like to delete the "duplicate" rows, leaving this:

Id,First,Last,Avail  
    abcdefg,John,Smith,4164667a-5dca-4ec6-a495-4be5b135d868=immediate;3ec0c158-8782-41ff-8388-5a10b9261b60=immediate;3ec0c158-8782-41ff-8388-5a10b9261b60=immediate  
    dgasgas,Nancy,Adams,f98a8fbd-fb88-49b9-894e-631ba2a6f369=immediate  
    gaytrjhu,John,Smith,e24ddf4c-c79f-4a84-a4ed-d92a10cc9e15=immediate 

Upvotes: 2

Views: 571

Answers (1)

scomes
scomes

Reputation: 1836

import pandas as pd

df = pd.DataFrame(data=[
        [1, 'John', 'Smith', 'a'],
        [1, 'John', 'Smith', 'b'],
        [2, 'Kate', 'Smith', 'c'],
    ],
    columns=['ID', 'First', 'Last', 'Avail']
)

output = (df
          .groupby(['ID', 'First', 'Last'], as_index=False)
          .agg({'Avail': lambda x: ';'.join(x)}))

You can use groupby as @Sphinx suggested. An example with the style of output you requested is above.

Upvotes: 1

Related Questions