Reputation: 101
I have a dataframe that outputs multiple rows for some IDs based on how much info I have on each, and I want to flatten it where possible. An example of my dataframe is below.
id address city state country company title email
001 80 Fake St. San F. CA USA null null [email protected]
001 null null null null Amazon CEO null
002 2 Test Ave. San F. CA USA null null [email protected]
002 null null null null Google CEO [email protected]
I would like for the rows for 001 to be combined into one, since all values can fit into the same row. For 002, it is ok for them to be on separate rows, since there needs to be two rows anyways to store both emails.
I've tried
df.groupby('id', squeeze=True)
which gets me close to what I want, but for 002 it duplicates all data from the first row and puts it on the second. I would prefer that those stay as null, so that the same data isn't being processed twice once I receive the final output. So my desired final output is:
id address city state country company title email
001 80 Fake St. San F. CA USA Amazon CEO [email protected]
002 2 Test Ave. San F. CA USA null null [email protected]
002 null null null null Google CEO [email protected]
I never know which columns will be able to be merged/flattened, so I can't drop duplicates based on certain columns (I don't think). Any suggestions on how to achieve this output would be great!
Upvotes: 2
Views: 143
Reputation: 323306
Here I am using sorted
, since it is more easy to read
yourdf=df.groupby('id').apply(lambda x : x.apply(lambda y: sorted(y,key=pd.isnull))).dropna(thresh=2)
yourdf
id address city state country company title email
0 1 80FakeSt. SanF. CA USA Amazon CEO [email protected]
2 2 2TestAve. SaF. CA USA Google CEO [email protected]
3 2 NaN NaN NaN NaN NaN NaN [email protected]
If you need the speed , check justify
Upvotes: 1