blackcat
blackcat

Reputation: 78

Combining different parts of two rows in a dataframe using Pandas

I have a large (>10000 rows) dataframe like below, and I have to combine the rows with the same name by col_1, col_2, col_3 of group a and col_4, col_5 of group b or group c (b & c won't exist at the same time).

I could just find online how to combine (add up) every column of the rows, put not in different parts.

Original dataframe

name group col_1 col_2 col_3 col_4 col_5
ann a 1 NaN 2 3 NaN
ann b 1 NaN NaN 2 3
ben a 0 1 2 1 NaN
ben c 0 NaN NaN 3 NaN
cat a 2 3 NaN NaN NaN

Desired result

name col_1 col_2 col_3 col_4 col_5
ann 1 NaN 2 2 3
ben 0 1 2 3 NaN
cat 2 3 NaN NaN NaN

Upvotes: 2

Views: 668

Answers (3)

sophocles
sophocles

Reputation: 13821

A bit of an ad-hoc approach, but one way you could do this is by sorting your dataframe by name and group, then split up your df into 2 parts (i.e. col1/2/3 for group a), perform a groupby and use first() on the first df and last() on the second, and then put back together using concat:

import pandas as pd

df.sort_values(['name','group'],ascending=True,inplace=True)

one = df.iloc[:,:5]
two = pd.concat([df.iloc[:,:2],df[['col_4','col_5']]],axis=1)

def my_func(df1,df2,col):
    return pd.concat([df1.groupby(col).first(),
                      df2.groupby(col).last()],
                      axis=1).drop('group',axis=1)
 
res = my_func(one,two,'name')

Which prints:

print(res)
      col_1  col_2  col_3  col_4  col_5
name                                   
ann       1    NaN   2.00   2.00   3.00
ben       0   1.00   2.00   3.00    NaN
cat       2   3.00    NaN    NaN    NaN

If i'm not mistaken this gets you what you need.

Upvotes: 5

rhug123
rhug123

Reputation: 8768

Here is another way of doing it:

The first step is to sort the values in name and group to make sure they are ordered correctly, then set them as the index. Then we make a dictionary depending on the number at the end of the column. Then we use .agg in our groupby() and pass in our new dictionary.

g = ['name','group']
df = df.sort_values(by = g,ascending = (1,1)).set_index(g)
d = {i : 'first' if int(i[-1])<=3 else 'last' for i in df.columns}
new_df = df.groupby(level=0).agg(d)

Upvotes: 1

anky
anky

Reputation: 75080

IIUC, you need to assign categorical value and then sort before grouping:

out = (df.assign(k=pd.Categorical(df['group'],['b','c','a'],ordered=True))
         .sort_values(['name','k'])
          .groupby("name",sort=False).first()).drop(["group","k"],1).reset_index()

print(out)

  name  col_1  col_2  col_3  col_4  col_5
0  ann      1    NaN    2.0    2.0    3.0
1  ben      0    1.0    2.0    3.0    NaN
2  cat      2    3.0    NaN    NaN    NaN

Upvotes: 3

Related Questions