Reputation: 113
I need to fill NA values in my main data frame based on a second dataframe I created by the groupby
and mean
functions. My original dataframe has about 1.5K NaNs I need to fill so this needs to reproducible at a mass scale. I've created a fake dataframe that's a short quick and dirty imitation of my data using a fake scenario. I can't share my real data with you.
My general idea is:
main_data[
(main_data["Animal_Type"] == mean_data["Animal_Type"]) &
(main_data["Cost_Type"] == mean_data["Cost_Type"])
] = main_data["Price"].fillna(mean_data["Price"])
Obviously, that doesn't work and but that's the general gist of how my logic is working. I found t[his answer][1] but I can't see to apply it properly to my problem. A lot of answers involve mask
or assume my data is pretty small with a single value to replace all my NaNs with. I have about 50 different means in my original dataset that are uniquely paired with a "Animal Type" per each "Cost Type". My original data frame is about 30K observations long full of unique observations too. I can map but that's only for a single column. I'm fairly new to coding so a lot of the other answers were too complicated for me too understand and alter too.
main_data
mean_data.head(10)
**Pet_ID Animal_Type Cost_Type Price**
0 101 Goat Housing 6.0
1 102 Dog Housing 6.0
2 103 Horse Housing NaN
3 104 Horse Housing 5.0
4 105 Goat Housing 3.0
5 106 Dog Feeding 3.0
6 107 Cat Feeding 6.0
7 108 Horse Housing 6.0
8 109 Hamster Feeding 5.0
9 110 Horse Feeding 3.0
mean_data
Animal_Type Cost_Type Price
0 Cat Feeding 4.500000
1 Cat Housing 5.000000
2 Chicken Feeding 5.000000
3 Chicken Housing 4.500000
4 Dog Feeding 3.000000
5 Dog Housing 6.000000
6 Goat Feeding 5.000000
7 Goat Housing 5.000000
8 Hamster Feeding 5.250000
9 Hamster Housing 3.000000
10 Horse Feeding 3.500000
11 Horse Housing 5.666667
12 Rabit Feeding 3.000000
13 Rabit Housing 3.000000
My Reproducible code:
random.seed(10)
random.seed(10)
main_data = pd.DataFrame(columns = ["Pet_ID", "Animal_Type", "Cost_Type", "Price", "Cost"])
main_data["Pet_ID"] = pd.Series(list(range(101,150)))
main_data["Animal_Type"] = main_data.Animal_Type.apply(lambda x: random.choice(["Dog", "Cat", "Rabit", "Horse", "Goat", "Chicken", "Hamster"]))
main_data["Cost_Type"] = main_data.Animal_Type.apply(lambda x: random.choice(["Housing", "Feeding"]))
main_data["Price"] = main_data.Price.apply(lambda x: random.choice([3, 5, 6, np.nan]))
main_data["Cost"] = main_data.Cost.apply(lambda x: random.choice([2, 1, 3, np.nan]))
mean_data = main_data.groupby(["Animal_Type", "Cost_Type"])["Price"].mean().reset_index()
Edit: I have put together two solutions but I wouldn't say it's the more elegant or dependable. Probably not the most efficient too.
main_data = pd.merge(
main_data,
mean_data,
on = ["Animal_Type", "Cost_Type"],
how = "left"
)
main_data["Price_z"] = main_data["Price_x"].fillna(main_data["Price_y"])
Edit 2: I've added a "Cost" Column with NaNs. I don't want this column touched but would like to use the same methodology with this column we're using for the Price column. [1]: Replace values based on multiple conditions with groupby mean in Pandas
Upvotes: 0
Views: 75
Reputation: 25354
I need to fill NA values in my main data frame based on a second dataframe I created by the
groupby
andmean
functions.
You don't need that step. You can do this in one step by grouping into multiple dataframes, applying mean on each individual dataframe, and filling NA values within just that dataframe.
So, instead of creating the mean_data
dataframe, do this:
def fill_by_mean(df):
df["Price"] = df["Price"].fillna(df["Price"].mean())
return df
main_data = main_data.groupby(["Animal_Type", "Cost_Type"]).apply(fill_by_mean)
Each individual call to fill_by_mean() sees a dataframe which looks like this:
Pet_ID Animal_Type Cost_Type Price
11 112 Rabit Feeding NaN
34 135 Rabit Feeding 3.0
38 139 Rabit Feeding 3.0
Then it gets the mean of the price column and fills NA values using that. Groupby then concatenates all of the individual dataframes back together.
Upvotes: 1