Reputation: 2342
I have Data Frame in Python Pandas like below:
data = [
(1, None, None, None, '2023-01-10', None, None),
(1, None, None, None, '2023-01-10', 1, 0),
(1, 9, 0, 0.55, '2023-01-10', 15, None),
(2, None, None, None, '2023-11-22', 2, 1),
(2, 88, 1, 0.68, '2023-11-22', 103, 8)
]
df = pd.DataFrame(data, columns=['id', 'col1', 'col2', 'col3', 'col_date', 'col4', 'col5'])
df
And I need to for each group of values from columns: id, col_date (data type datetime) fill values in columns: col1, col2, col3. For each group of values from columns: id, col_date at least one row has values in: col1, col2, col3 and I need to fill rest of rows by this values for each mentioned group.
Values in columns: col4, col5 (and rest of many more columns whoch was not included in this example) have stay untouched.
So as a result I need something like below:
data = [
(1, 9, 0, 0.55, '2023-01-10', None, None),
(1, 9, 0, 0.55, '2023-01-10', 1, 0),
(1, 9, 0, 0.55, '2023-01-10', 15, None),
(2, 88, 1, 0.68, '2023-11-22', 2, 1),
(2, 88, 1, 0.68, '2023-11-22', 103, 8)
]
df = pd.DataFrame(data, columns=['id', 'col1', 'col2', 'col3', 'col_date', 'col4', 'col5'])
df
How can I do that in Python Pandas ?
Upvotes: 0
Views: 60
Reputation: 295
See if you can try this. Basically, the below code snippet merges dataframe on to itself (after keeping in non nan rows). Note that df2 will have only those rows where all the three columns (col1,col2 and col3) have non null values.
df2=df[['id','col_date','col1','col2','col3']].copy(deep=True)
df2 = df2[df[['col1','col2','col3']].notnull().all(1)]
final=df.merge(df2,on=['id','col_date'],how='inner')
final=final[['id','col1_y','col2_y','col3_y','col_date','col4','col5']].rename(columns={'col1_y':'col1','col2_y':'col2','col3_y':'col3'})
Upvotes: 0
Reputation: 120559
You can still use my previous answer but select desired columns between groupby
and transform
:
>>> df.fillna(df.groupby(['id', 'col_date'])[['col1', 'col2', 'col3']]
.transform('max'))
id col1 col2 col3 col_date col4 col5
0 1 9.0 0.0 0.55 2023-01-10 NaN NaN
1 1 9.0 0.0 0.55 2023-01-10 1.0 0.0
2 1 9.0 0.0 0.55 2023-01-10 15.0 NaN
3 2 88.0 1.0 0.68 2023-11-22 2.0 1.0
4 2 88.0 1.0 0.68 2023-11-22 103.0 8.0
max
only works if your dtype is numeric. If it's not the case, you can use a custom function. Replace 'max'
with lambda x: x.dropna().squeeze()
or lambda x: x.ffill().bfill()
:
>>> df.dtypes
id int64
col1 object
col2 float64
col3 object
col_date object
col4 float64
col5 float64
dtype: object
>>> df.fillna(df.groupby(['id', 'col_date'])[['col1', 'col2', 'col3']]
.transform(lambda x: x.ffill().bfill()))
id col1 col2 col3 col_date col4 col5
0 1 9.0 0.0 0.55 2023-01-10 NaN NaN
1 1 9.0 0.0 0.55 2023-01-10 1.0 0.0
2 1 9.0 0.0 0.55 2023-01-10 15.0 NaN
3 2 88.0 1.0 0.68 2023-11-22 2.0 1.0
4 2 88.0 1.0 0.68 2023-11-22 103.0 8.0
Upvotes: 0