Reputation: 78

Combining different parts of two rows in a dataframe using Pandas

I have a large (>10000 rows) dataframe like below, and I have to combine the rows with the same name by col_1, col_2, col_3 of group a and col_4, col_5 of group b or group c (b & c won't exist at the same time).

I could just find online how to combine (add up) every column of the rows, put not in different parts.

Original dataframe

name	group	col_1	col_2	col_3	col_4	col_5
ann	a	1	NaN	2	3	NaN
ann	b	1	NaN	NaN	2	3
ben	a	0	1	2	1	NaN
ben	c	0	NaN	NaN	3	NaN
cat	a	2	3	NaN	NaN	NaN

Desired result

name	col_1	col_2	col_3	col_4	col_5
ann	1	NaN	2	2	3
ben	0	1	2	3	NaN
cat	2	3	NaN	NaN	NaN

Upvotes: 2

Answers (3)

sophocles

Reputation: 13821

A bit of an ad-hoc approach, but one way you could do this is by sorting your dataframe by name and group, then split up your df into 2 parts (i.e. col1/2/3 for group a), perform a groupby and use first() on the first df and last() on the second, and then put back together using concat:

import pandas as pd

df.sort_values(['name','group'],ascending=True,inplace=True)

one = df.iloc[:,:5]
two = pd.concat([df.iloc[:,:2],df[['col_4','col_5']]],axis=1)

def my_func(df1,df2,col):
    return pd.concat([df1.groupby(col).first(),
                      df2.groupby(col).last()],
                      axis=1).drop('group',axis=1)
 
res = my_func(one,two,'name')

Which prints:

print(res)
      col_1  col_2  col_3  col_4  col_5
name                                   
ann       1    NaN   2.00   2.00   3.00
ben       0   1.00   2.00   3.00    NaN
cat       2   3.00    NaN    NaN    NaN

If i'm not mistaken this gets you what you need.

Upvotes: 5

rhug123

Reputation: 8768

Here is another way of doing it:

The first step is to sort the values in name and group to make sure they are ordered correctly, then set them as the index. Then we make a dictionary depending on the number at the end of the column. Then we use .agg in our groupby() and pass in our new dictionary.

g = ['name','group']
df = df.sort_values(by = g,ascending = (1,1)).set_index(g)
d = {i : 'first' if int(i[-1])<=3 else 'last' for i in df.columns}
new_df = df.groupby(level=0).agg(d)

Upvotes: 1

anky

Reputation: 75080

IIUC, you need to assign categorical value and then sort before grouping:

out = (df.assign(k=pd.Categorical(df['group'],['b','c','a'],ordered=True))
         .sort_values(['name','k'])
          .groupby("name",sort=False).first()).drop(["group","k"],1).reset_index()

print(out)

  name  col_1  col_2  col_3  col_4  col_5
0  ann      1    NaN    2.0    2.0    3.0
1  ben      0    1.0    2.0    3.0    NaN
2  cat      2    3.0    NaN    NaN    NaN

Upvotes: 3

name	group	col_1	col_2	col_3	col_4	col_5
ann	a	1	NaN	2	3	NaN
ann	b	1	NaN	NaN	2	3
ben	a	0	1	2	1	NaN
ben	c	0	NaN	NaN	3	NaN
cat	a	2	3	NaN	NaN	NaN

name	group	col_1	col_2	col_3	col_4	col_5
ann	a	1	NaN	2	3	NaN
ann	b	1	NaN	NaN	2	3
ben	a	0	1	2	1	NaN
ben	c	0	NaN	NaN	3	NaN
cat	a	2	3	NaN	NaN	NaN

Combining different parts of two rows in a dataframe using Pandas

Answers (3)

Related Questions

name	group	col_1	col_2	col_3	col_4	col_5
ann	a	1	NaN	2	3	NaN
ann	b	1	NaN	NaN	2	3
ben	a	0	1	2	1	NaN
ben	c	0	NaN	NaN	3	NaN
cat	a	2	3	NaN	NaN	NaN