Reputation: 784
Suppose I have this kind of Dataframe:
Data: Lat Long Postal Code
0 41 32 01556
1 32 31 01023
2 31 33 01023
3 NaN NaN 01023
4 33 42 01775
5 40 44 01999
As you can see, rows 1,2,3 have the same postal code. So, in order to fill the NaNs, it would be nice to just use the average of those 2 rows (1,2). How can I generalize this for a large dataset?
Upvotes: 1
Views: 769
Reputation: 23099
IIUC,
groupby
, transform
, fillna()
We first select a slice of our dataframe and use fillna
to only fill missing values, we don't want to overwrite any of the existing data.
we then leverage the groupby
function to group by postal codes as you requested.
we use the transform
method which returns your data with its original index and length.
we assign this to your columns and have the result as below.
df[["Lat", "Long"]] = df[["Lat", "Long"]].fillna(
df.groupby("Postal Code")["Lat", "Long"].transform("mean"))
print(df)
Data Lat Long Postal Code
0 0 41.0 32.0 1556
1 1 32.0 31.0 1023
2 2 31.0 33.0 1023
3 3 31.5 32.0 1023
4 4 33.0 42.0 1775
5 5 40.0 44.0 1999
Upvotes: 2