SiDoesCoding
SiDoesCoding

Reputation: 111

Combine pandas DataFrame rows with overlapping data

I have a dataset where on occasion a row of data ends up being split across two rows of a Dataframe. I am able to isolate these rows and would like to combine the two rows together, unfortunately it is not always straight forward as the data is assigned upstream on a first in basis to the DB.

Example df:

import pandas as pd
import numpy as np

d = {'id':[1,2,2,3,3],
     'a1':[1,1,np.NaN,1,np.NaN],
     'b1':[1,2,np.NaN,2,np.NaN],
     'c1':[1,3,np.NaN,3,5],
     'c2':[1,4,np.NaN,4,6],
     'c3':[1,np.NaN,5,np.NaN,7],
     'c4':[1,np.NaN,6,np.NaN,np.NaN],
     'c5':[1,np.NaN,7,np.NaN,np.NaN],
     'd1':[1,np.NaN,8,np.NaN,8]}

df = pd.DataFrame(d)

    id  a1  b1  c1  c2  c3  c4  c5  d1
0   1   1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
1   2   1.0 2.0 3.0 4.0 NaN NaN NaN NaN
2   2   NaN NaN NaN NaN 5.0 6.0 7.0 8.0
3   3   1.0 2.0 3.0 4.0 NaN NaN NaN NaN
4   3   NaN NaN 5.0 6.0 7.0 NaN NaN 8.0

For clarity duplicates are on 'id'. And the intention is to have a single row per ID within df.

Scenario 1 (ID = 2): index 2 values 5,6,7,8 would replace the NaN values in index 1 directly above.

Scenario 2 (ID = 3): index 4 values 5,6,7 would need to move to columns C3, C4, C5 within index 3 and value 8 (d1) would replace the NaN value in d1/index 3

This occurs multiple times within a large dataset and varies about the column in which the split happens so I would ideally need to be able to do this dynamically where a duplicate ID is found.

EDIT: For Additional Scenario

d = {'id':[1,4,4],
 'a1':[1,1,np.NaN],
 'b1':[2,2,np.NaN],
 'c1':[3,3,5],
 'c2':[4,4,6],
 'c3':[5,np.NaN,np.NaN],
 'c4':[6,np.NaN,np.NaN],
 'c5':[7,np.NaN,np.NaN],
 'd1':[8,np.NaN,8]}

df2 = pd.DataFrame(d)

    id  a1  b1  c1  c2  c3  c4  c5  d1
0   1   1.0 2.0 3   4   5.0 6.0 7.0 8.0
1   4   1.0 2.0 3   4   NaN NaN NaN NaN
2   4   NaN NaN 5   6   NaN NaN NaN 8.0

Scenario 3 (ID = 4): the same as scenario 2 however c5 (id4) would remain as a NaN value in this case.

Upvotes: 1

Views: 278

Answers (1)

mozway
mozway

Reputation: 260430

ideal number of values to fill the NaNs

Assuming you always have the ideal number of values, you can flatten per group without NaNs:

(df.set_index('id')
   .groupby(level=0)
   .apply(lambda d: d.stack().dropna().set_axis(df.columns[1:]))
   .reset_index()
)

output:

   id   a1   b1   c1   c2   c3   c4   c5   d1
0   1  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0
1   2  1.0  2.0  3.0  4.0  5.0  6.0  7.0  8.0
2   3  1.0  2.0  3.0  4.0  5.0  6.0  7.0  8.0
alternative for an arbitrary number of "id"/grouping columns:
id_cols = ['id']
(df.set_index(id_cols)
   .groupby(level=range(len(id_cols)))
   .apply(lambda d: d.stack().dropna().set_axis(list(df.drop(columns=id_cols))))
   .reset_index()
)

filling values per group of columns

In this alternative the columns are grouped per Xn where X is a non-digit word and n the sublevel.

Here I used a double groupby, once on the rows, once on the columns:

def flat(d):
    return (d.groupby(d.columns.str.extract('(^\D+)', expand=False), axis=1)
             .apply(lambda d: pd.DataFrame([sorted(d.values.ravel(),
                                                   key=pd.isna)[:len(d.columns)]],
                                           columns=d.columns)
                    )).iloc[0]

(df2.set_index('id')
   .groupby(level=0)
   .apply(flat)
   .reset_index()
)

output:

0  id   a1   b1   c1   c2   c3   c4   c5   d1
0   1  1.0  2.0  3.0  4.0  5.0  6.0  7.0  8.0
1   4  1.0  2.0  3.0  4.0  5.0  6.0  NaN  8.0

Upvotes: 2

Related Questions