Reputation: 21

How replace null value in data frame with median of two columned grouped value?

I have a data frame in Python which includes the frequency of using some food by individuals in a week. I would like to clean my data frame and replace null values with the median frequency of each food category that used by each individual. How can I replace the null value with meidan of each food category for each person?

user  ffq    food       food-category
 1     1     apple         fruit
 1     3     banana        fruit
 1     2     tomato      vegetables
 1   nan     carrot      vegetables
 1     3     potato      vegetables
 1    nan    peach        fruit
 2     3     apple        fruit
 2    nan    banana       fruit
 2     2     tomato       vegetables
 2     nan   carrot       vegetables
 2     3   peach          fruit

and the result should be like :

user  ffq    food       food-category
 1     1     apple         fruit
 1     3     banana        fruit
 1     2     tomato      vegetables
 1 **2.5**   carrot      vegetables
 1     3     potato      vegetables
 1   **2**   peach        fruit
 2     3     apple        fruit
 2   **3**   banana       fruit
 2     2     tomato       vegetables
 2   **2**   carrot       vegetables
 2     3     peach        fruit

I appreciate if anyone can help

Upvotes: 1

Answers (2)

Kaymal

Reputation: 586

I guess you want to fill the missing values with the mean of the groups instead of the median. We can use .fillna() along with .groupby() and .transform() functions to accomplish this with one line of code. First, let's create the DataFrame with the required columns.

# Create a DataFrame
df = pd.DataFrame({'user':['1','1','1','1','1','1', '2', '2', '2', '2', '2'], 
                   'ffq':[1, 3, 2, np.nan, 3, np.nan, 3, np.nan, 2, np.nan, 3],
                   'food-category':['fruit', 'fruit', 'vegetables', 'vegetables', 
                                    'vegetables', 'fruit', 'fruit', 'fruit', 'vegetables', 
                                    'vegetables', 'fruit']})

We can now fill the missing values with desired imputation method such as mean, median or mode. Below imputation is done with mean to get the result mentioned in the question.

# Apply fillna function within each group
df['ffq'] = df.groupby(['user', 'food-category']).transform(lambda x: x.fillna(x.mean()))

    user   ffq   food-category
0   1      1.0   fruit
1   1      3.0   fruit
2   1      2.0   vegetables
3   1      2.5   vegetables
4   1      3.0   vegetables
5   1      2.0   fruit
6   2      3.0   fruit
7   2      3.0   fruit
8   2      2.0   vegetables
9   2      2.0   vegetables
10  2      3.0   fruit

The .transform() method is used to perform group-specific computation, which is mean in this example, and it returns a like-indexed object. See User Guide for more information.

Upvotes: 1

pythonic833

Reputation: 3224

Here is how you do that. First we need to sort the values, so that they appear in the correct order when we use the groupby. Next we calculate the mean and afterwards we need to fill the NaNs with the series we extracted.

df = df.sort_values(['user','food-category'])
srs = df.dropna().groupby(['user','food-category']).agg({'ffq':'mean'})['ffq']
srs.index = df[df['ffq'].isnull()].index
df['ffq'] = df['ffq'].fillna(value=srs)

result

df.sort_index()
    user    ffq     food    food-category
0   1       1.0     apple   fruit
1   1       3.0     banana  fruit
2   1       2.0     tomato  vegetables
3   1       2.5     carrot  vegetables
4   1       3.0     potato  vegetables
5   1       2.0     peach   fruit
6   2       3.0     apple   fruit
7   2       3.0     banana  fruit
8   2       2.0     tomato  vegetables
9   2       2.0     carrot  vegetables
10  2       3.0     peach   fruit

Upvotes: 0

How replace null value in data frame with median of two columned grouped value?

Answers (2)

Related Questions