Pandas Dataframe: Remove duplicate rows and append data to remaining unique row

Question

I have the following structure of the df (as an example):

Edited data based on comment

import pandas as pd
    
    
data = {'ID':["abc", "abc", "123", "xyz", "xyz"], 'valid':[["A", "B"], [], [], ["A", "B", "C"], []],
        'not_valid':[[], ["C", "D"], ["A", "B", "C", "D"], [], ["D"]],
       'other_data': ["1", "1", "3", "4", "4"]}

df = pd.DataFrame(data)

Which gives:

ID    valid         not_valid      other_data
abc   [A, B]        []             1
abc   []            [C, D]         1
123   []            [A, B, C, D]   3
xyz   [A, B, C]     []             4
xyz   []            [D]            4

What I want as a result is the following:

ID    valid         not_valid      other_data
abc   [A, B]        [C, D]         1
123   []            [A, B, C, D]   3
xyz   [A, B, C]     [D]            4

I was wondering how I can merge the two columns valid and not_valid for each unique ID together (not every ID has necessarily two entries like in this case 123).

While I can remove duplicates with

df.drop_duplicates(subset=['ID']).reset_index()

I obviously loose half of the data for duplicate IDs.

jezrael · Accepted Answer

You can flatten values with convert values to sets and then to lists in GroupBy.agg:

cols = ['valid','not_valid']
df = df.groupby('ID', sort=False)[cols].agg(lambda x: list(set(z for y in x for z in y)))
print (df)
         valid     not_valid
ID                          
abc     [A, B]        [D, C]
123         []  [A, D, B, C]
xyz  [A, B, C]           [D]

EDIT: If all values of another columns are same per groups by ID use:

#grouping by all columns without valid, not_valid
c = df.columns.difference(['valid','not_valid']).tolist()
f = lambda x: list(set(z for y in x for z in y))
df1 = df.groupby(c, sort=False, as_index=False).agg(f).reindex(df.columns, axis=1)
print (df1)
    ID      valid     not_valid other_data
0  abc     [A, B]        [D, C]          1
1  123         []  [A, D, B, C]          3
2  xyz  [A, B, C]           [D]          4

Pandas Dataframe: Remove duplicate rows and append data to remaining unique row

Answers (1)

Related Questions