Emm
Emm

Reputation: 113

How to Fill NaNs in Column of Main Dataframe Based On Conditions Matching Secondary Dataframe of Values to Fill NaNs With Multiple Filler Values

I need to fill NA values in my main data frame based on a second dataframe I created by the groupby and mean functions. My original dataframe has about 1.5K NaNs I need to fill so this needs to reproducible at a mass scale. I've created a fake dataframe that's a short quick and dirty imitation of my data using a fake scenario. I can't share my real data with you.

My general idea is:

main_data[
          (main_data["Animal_Type"] == mean_data["Animal_Type"]) & 
          (main_data["Cost_Type"] == mean_data["Cost_Type"])
         ] = main_data["Price"].fillna(mean_data["Price"])

Obviously, that doesn't work and but that's the general gist of how my logic is working. I found t[his answer][1] but I can't see to apply it properly to my problem. A lot of answers involve mask or assume my data is pretty small with a single value to replace all my NaNs with. I have about 50 different means in my original dataset that are uniquely paired with a "Animal Type" per each "Cost Type". My original data frame is about 30K observations long full of unique observations too. I can map but that's only for a single column. I'm fairly new to coding so a lot of the other answers were too complicated for me too understand and alter too.

main_data

mean_data.head(10)

   **Pet_ID Animal_Type Cost_Type   Price**
0   101     Goat        Housing     6.0
1   102     Dog         Housing     6.0
2   103     Horse       Housing     NaN
3   104     Horse       Housing     5.0
4   105     Goat        Housing     3.0
5   106     Dog         Feeding     3.0
6   107     Cat         Feeding     6.0
7   108     Horse       Housing     6.0
8   109     Hamster     Feeding     5.0
9   110     Horse       Feeding     3.0

mean_data

    Animal_Type Cost_Type   Price
0   Cat         Feeding     4.500000
1   Cat         Housing     5.000000
2   Chicken     Feeding     5.000000
3   Chicken     Housing     4.500000
4   Dog         Feeding     3.000000
5   Dog         Housing     6.000000
6   Goat        Feeding     5.000000
7   Goat        Housing     5.000000
8   Hamster     Feeding     5.250000
9   Hamster     Housing     3.000000
10  Horse       Feeding     3.500000
11  Horse       Housing     5.666667
12  Rabit       Feeding     3.000000
13  Rabit       Housing     3.000000

My Reproducible code:

random.seed(10)

random.seed(10)

main_data = pd.DataFrame(columns = ["Pet_ID", "Animal_Type", "Cost_Type", "Price", "Cost"])

main_data["Pet_ID"] = pd.Series(list(range(101,150)))
main_data["Animal_Type"] = main_data.Animal_Type.apply(lambda x: random.choice(["Dog", "Cat", "Rabit", "Horse", "Goat", "Chicken", "Hamster"])) 
main_data["Cost_Type"] = main_data.Animal_Type.apply(lambda x: random.choice(["Housing", "Feeding"])) 
main_data["Price"] = main_data.Price.apply(lambda x: random.choice([3, 5, 6, np.nan])) 
main_data["Cost"] =  main_data.Cost.apply(lambda x: random.choice([2, 1, 3, np.nan])) 

mean_data = main_data.groupby(["Animal_Type", "Cost_Type"])["Price"].mean().reset_index()

Edit: I have put together two solutions but I wouldn't say it's the more elegant or dependable. Probably not the most efficient too.

main_data = pd.merge(
    main_data,
    mean_data,
    on = ["Animal_Type", "Cost_Type"],
    how = "left"
)

main_data["Price_z"] = main_data["Price_x"].fillna(main_data["Price_y"])

Edit 2: I've added a "Cost" Column with NaNs. I don't want this column touched but would like to use the same methodology with this column we're using for the Price column. [1]: Replace values based on multiple conditions with groupby mean in Pandas

Upvotes: 0

Views: 75

Answers (1)

Nick ODell
Nick ODell

Reputation: 25354

I need to fill NA values in my main data frame based on a second dataframe I created by the groupby and mean functions.

You don't need that step. You can do this in one step by grouping into multiple dataframes, applying mean on each individual dataframe, and filling NA values within just that dataframe.

So, instead of creating the mean_data dataframe, do this:

def fill_by_mean(df):
    df["Price"] = df["Price"].fillna(df["Price"].mean())
    return df

main_data = main_data.groupby(["Animal_Type", "Cost_Type"]).apply(fill_by_mean)

Each individual call to fill_by_mean() sees a dataframe which looks like this:

    Pet_ID Animal_Type Cost_Type  Price
11     112       Rabit   Feeding    NaN
34     135       Rabit   Feeding    3.0
38     139       Rabit   Feeding    3.0

Then it gets the mean of the price column and fills NA values using that. Groupby then concatenates all of the individual dataframes back together.

Upvotes: 1

Related Questions