Reputation: 5555
Let's say that I have the following data-frame:
df = pd.DataFrame({"unique_id": [1, 1, 1], "att1_amr": [11, 11, 11], "att2_nominal": [1, np.nan, np.nan], "att3_nominal": [np.nan, 1, np.nan], "att4_bok": [33.33, 33.33, 33.33], "att5_nominal": [np.nan, np.nan, np.nan], "att6_zpq": [22.22, 22.22, 22.22]})
What I want to do is group-by the rows of the data-frame by unique_id
such that I can apply a separate group-by operation on the columns that contain the word nominal
and a separate to all other. To be more specific, I want to group-by the columns that contain nominal
using sum(min_count = 1)
and the other with first()
or last()
. The result should be the following:
df_result = pd.DataFrame({"unique_id": [1], "att1_amr": [11], "att2_nominal": [1], "att3_nominal": [1], "att4_bok": [33.33], "att5_nominal": [np.nan], "att6_zpq": [22.22]})
Thank you!
Upvotes: 1
Views: 285
Reputation: 5555
The solution provided by @jezrael works just fine while being the most elegant one, however, I ran into severe performance issues. Surprisingly, I found this to be a much faster solution while achieving the same goal.
nominal_cols = df.filter(like="nominal").columns.values
other_cols = [col for col in df.columns.values if col not in nominal_cols and col != "unique_id"]
df1 = df.groupby('unique_id', as_index=False)[nominal_cols].sum(min_count=1)
df2 = df.groupby('unique_id', as_index=False)[other_cols].first()
pd.merge(df1, df2, on=["unique_id"], how="inner")
Upvotes: 0
Reputation: 862471
You can create dictionary dynamically - first all columns with nominal
with lambda function and then all another columns with last
and merge it together, last call DataFrameGroupBy.agg
:
d1 = dict.fromkeys(df.columns[df.columns.str.contains('nominal')],
lambda x : x.sum(min_count=1))
d2 = dict.fromkeys(df.columns.difference(['unique_id'] + list(d1)), 'last')
d = {**d1, **d2}
df = df.groupby('unique_id').agg(d)
print (df)
att2_nominal att3_nominal att5_nominal att1_amr att4_bok \
unique_id
1 1.0 1.0 NaN 11 33.33
att6_zpq
unique_id
1 22.22
Another more cleaner solution:
d = {k: (lambda x : x.sum(min_count=1))
if 'nominal' in k
else 'last'
for k in df.columns.difference(['unique_id'])}
df = df.groupby('unique_id').agg(d)
print (df)
att1_amr att2_nominal att3_nominal att4_bok att5_nominal \
unique_id
1 11 1.0 1.0 33.33 NaN
att6_zpq
unique_id
1 22.22
Upvotes: 2
Reputation: 71560
Why not just:
>>> df.ffill().bfill().drop_duplicates()
att1_amr att2_nominal att3_nominal att4_bok att5_nominal att6_zpq \
0 11 1.0 1.0 33.33 NaN 22.22
unique_id
0 1
>>>
Upvotes: 0