Reputation: 154
I have the following structure of the df (as an example):
Edited data based on comment
import pandas as pd
data = {'ID':["abc", "abc", "123", "xyz", "xyz"], 'valid':[["A", "B"], [], [], ["A", "B", "C"], []],
'not_valid':[[], ["C", "D"], ["A", "B", "C", "D"], [], ["D"]],
'other_data': ["1", "1", "3", "4", "4"]}
df = pd.DataFrame(data)
Which gives:
ID valid not_valid other_data
abc [A, B] [] 1
abc [] [C, D] 1
123 [] [A, B, C, D] 3
xyz [A, B, C] [] 4
xyz [] [D] 4
What I want as a result is the following:
ID valid not_valid other_data
abc [A, B] [C, D] 1
123 [] [A, B, C, D] 3
xyz [A, B, C] [D] 4
I was wondering how I can merge the two columns valid
and not_valid
for each unique ID together (not every ID has necessarily two entries like in this case 123
).
While I can remove duplicates with
df.drop_duplicates(subset=['ID']).reset_index()
I obviously loose half of the data for duplicate IDs.
Upvotes: 1
Views: 1256
Reputation: 863166
You can flatten values with convert values to sets and then to lists in GroupBy.agg
:
cols = ['valid','not_valid']
df = df.groupby('ID', sort=False)[cols].agg(lambda x: list(set(z for y in x for z in y)))
print (df)
valid not_valid
ID
abc [A, B] [D, C]
123 [] [A, D, B, C]
xyz [A, B, C] [D]
EDIT: If all values of another columns are same per groups by ID
use:
#grouping by all columns without valid, not_valid
c = df.columns.difference(['valid','not_valid']).tolist()
f = lambda x: list(set(z for y in x for z in y))
df1 = df.groupby(c, sort=False, as_index=False).agg(f).reindex(df.columns, axis=1)
print (df1)
ID valid not_valid other_data
0 abc [A, B] [D, C] 1
1 123 [] [A, D, B, C] 3
2 xyz [A, B, C] [D] 4
Upvotes: 3