Reputation: 111
I have a dataset where on occasion a row of data ends up being split across two rows of a Dataframe. I am able to isolate these rows and would like to combine the two rows together, unfortunately it is not always straight forward as the data is assigned upstream on a first in basis to the DB.
Example df:
import pandas as pd
import numpy as np
d = {'id':[1,2,2,3,3],
'a1':[1,1,np.NaN,1,np.NaN],
'b1':[1,2,np.NaN,2,np.NaN],
'c1':[1,3,np.NaN,3,5],
'c2':[1,4,np.NaN,4,6],
'c3':[1,np.NaN,5,np.NaN,7],
'c4':[1,np.NaN,6,np.NaN,np.NaN],
'c5':[1,np.NaN,7,np.NaN,np.NaN],
'd1':[1,np.NaN,8,np.NaN,8]}
df = pd.DataFrame(d)
id a1 b1 c1 c2 c3 c4 c5 d1
0 1 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
1 2 1.0 2.0 3.0 4.0 NaN NaN NaN NaN
2 2 NaN NaN NaN NaN 5.0 6.0 7.0 8.0
3 3 1.0 2.0 3.0 4.0 NaN NaN NaN NaN
4 3 NaN NaN 5.0 6.0 7.0 NaN NaN 8.0
For clarity duplicates are on 'id'. And the intention is to have a single row per ID within df.
Scenario 1 (ID = 2): index 2 values 5,6,7,8 would replace the NaN values in index 1 directly above.
Scenario 2 (ID = 3): index 4 values 5,6,7 would need to move to columns C3, C4, C5 within index 3 and value 8 (d1) would replace the NaN value in d1/index 3
This occurs multiple times within a large dataset and varies about the column in which the split happens so I would ideally need to be able to do this dynamically where a duplicate ID is found.
EDIT: For Additional Scenario
d = {'id':[1,4,4],
'a1':[1,1,np.NaN],
'b1':[2,2,np.NaN],
'c1':[3,3,5],
'c2':[4,4,6],
'c3':[5,np.NaN,np.NaN],
'c4':[6,np.NaN,np.NaN],
'c5':[7,np.NaN,np.NaN],
'd1':[8,np.NaN,8]}
df2 = pd.DataFrame(d)
id a1 b1 c1 c2 c3 c4 c5 d1
0 1 1.0 2.0 3 4 5.0 6.0 7.0 8.0
1 4 1.0 2.0 3 4 NaN NaN NaN NaN
2 4 NaN NaN 5 6 NaN NaN NaN 8.0
Scenario 3 (ID = 4): the same as scenario 2 however c5 (id4) would remain as a NaN value in this case.
Upvotes: 1
Views: 278
Reputation: 260430
Assuming you always have the ideal number of values, you can flatten per group without NaNs:
(df.set_index('id')
.groupby(level=0)
.apply(lambda d: d.stack().dropna().set_axis(df.columns[1:]))
.reset_index()
)
output:
id a1 b1 c1 c2 c3 c4 c5 d1
0 1 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
1 2 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0
2 3 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0
id_cols = ['id']
(df.set_index(id_cols)
.groupby(level=range(len(id_cols)))
.apply(lambda d: d.stack().dropna().set_axis(list(df.drop(columns=id_cols))))
.reset_index()
)
In this alternative the columns are grouped per Xn
where X
is a non-digit word and n
the sublevel.
Here I used a double groupby
, once on the rows, once on the columns:
def flat(d):
return (d.groupby(d.columns.str.extract('(^\D+)', expand=False), axis=1)
.apply(lambda d: pd.DataFrame([sorted(d.values.ravel(),
key=pd.isna)[:len(d.columns)]],
columns=d.columns)
)).iloc[0]
(df2.set_index('id')
.groupby(level=0)
.apply(flat)
.reset_index()
)
output:
0 id a1 b1 c1 c2 c3 c4 c5 d1
0 1 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0
1 4 1.0 2.0 3.0 4.0 5.0 6.0 NaN 8.0
Upvotes: 2