Reputation: 727
Suppose I have a dataframe with rows containing missing data, but a set of columns acting as a key:
import pandas as pd
import numpy as np
data = {"id": [1, 1, 2, 2, 3, 3, 4 ,4], "name": ["John", "John", "Paul", "Paul", "Ringo", "Ringo", "George", "George"], "height": [178, np.nan, 182, np.nan, 175, np.nan, 188, np.nan], "weight": [np.nan, np.NaN, np.nan, 72, np.nan, 68, np.nan, 70]}
df = pd.DataFrame.from_dict(data)
print(df)
id name height weight
0 1 John 178.0 NaN
1 1 John NaN NaN
2 2 Paul 182.0 NaN
3 2 Paul NaN 72.0
4 3 Ringo 175.0 NaN
5 3 Ringo NaN 68.0
6 4 George 188.0 NaN
7 4 George NaN 70.0
How would I go about "squashing" these rows with duplicate keys down to pick the non-nan value (if it exists)?
desired output:
id name height weight
0 1 John 178.0 NaN
2 2 Paul 182.0 72.0
4 3 Ringo 175.0 68.0
6 4 George 188.0 70.0
The index doesn't matter, and there is always at most one row with Non-NaN data. I think I need to use groupby(['id', 'name'])
, but I'm not sure where to go from there.
Upvotes: 2
Views: 180
Reputation: 862581
If there are always only one non NaN
s values per groups is possible aggregate many ways:
df = df.groupby(['id', 'name'], as_index=False).first()
Or:
df = df.groupby(['id', 'name'], as_index=False).last()
Or:
df = df.groupby(['id', 'name'], as_index=False).mean()
Or:
df = df.groupby(['id', 'name'], as_index=False).sum(min_count=1)
Upvotes: 2