Reputation: 78
I have a large (>10000 rows) dataframe like below, and I have to combine the rows with the same name by col_1, col_2, col_3 of group a and col_4, col_5 of group b or group c (b & c won't exist at the same time).
I could just find online how to combine (add up) every column of the rows, put not in different parts.
Original dataframe
name | group | col_1 | col_2 | col_3 | col_4 | col_5 |
---|---|---|---|---|---|---|
ann | a | 1 | NaN | 2 | 3 | NaN |
ann | b | 1 | NaN | NaN | 2 | 3 |
ben | a | 0 | 1 | 2 | 1 | NaN |
ben | c | 0 | NaN | NaN | 3 | NaN |
cat | a | 2 | 3 | NaN | NaN | NaN |
Desired result
name | col_1 | col_2 | col_3 | col_4 | col_5 |
---|---|---|---|---|---|
ann | 1 | NaN | 2 | 2 | 3 |
ben | 0 | 1 | 2 | 3 | NaN |
cat | 2 | 3 | NaN | NaN | NaN |
Upvotes: 2
Views: 668
Reputation: 13821
A bit of an ad-hoc approach, but one way you could do this is by sorting your dataframe by name and group, then split up your df
into 2 parts (i.e. col1/2/3 for group a), perform a groupby
and use first()
on the first df
and last()
on the second, and then put back together using concat
:
import pandas as pd
df.sort_values(['name','group'],ascending=True,inplace=True)
one = df.iloc[:,:5]
two = pd.concat([df.iloc[:,:2],df[['col_4','col_5']]],axis=1)
def my_func(df1,df2,col):
return pd.concat([df1.groupby(col).first(),
df2.groupby(col).last()],
axis=1).drop('group',axis=1)
res = my_func(one,two,'name')
Which prints:
print(res)
col_1 col_2 col_3 col_4 col_5
name
ann 1 NaN 2.00 2.00 3.00
ben 0 1.00 2.00 3.00 NaN
cat 2 3.00 NaN NaN NaN
If i'm not mistaken this gets you what you need.
Upvotes: 5
Reputation: 8768
Here is another way of doing it:
The first step is to sort the values in name
and group
to make sure they are ordered correctly, then set them as the index. Then we make a dictionary depending on the number at the end of the column. Then we use .agg
in our groupby()
and pass in our new dictionary.
g = ['name','group']
df = df.sort_values(by = g,ascending = (1,1)).set_index(g)
d = {i : 'first' if int(i[-1])<=3 else 'last' for i in df.columns}
new_df = df.groupby(level=0).agg(d)
Upvotes: 1
Reputation: 75080
IIUC, you need to assign categorical value and then sort before grouping:
out = (df.assign(k=pd.Categorical(df['group'],['b','c','a'],ordered=True))
.sort_values(['name','k'])
.groupby("name",sort=False).first()).drop(["group","k"],1).reset_index()
print(out)
name col_1 col_2 col_3 col_4 col_5
0 ann 1 NaN 2.0 2.0 3.0
1 ben 0 1.0 2.0 3.0 NaN
2 cat 2 3.0 NaN NaN NaN
Upvotes: 3