Reputation: 21
I have a data frame in Python which includes the frequency of using some food by individuals in a week. I would like to clean my data frame and replace null values with the median frequency of each food category that used by each individual. How can I replace the null value with meidan of each food category for each person?
user ffq food food-category
1 1 apple fruit
1 3 banana fruit
1 2 tomato vegetables
1 nan carrot vegetables
1 3 potato vegetables
1 nan peach fruit
2 3 apple fruit
2 nan banana fruit
2 2 tomato vegetables
2 nan carrot vegetables
2 3 peach fruit
and the result should be like :
user ffq food food-category
1 1 apple fruit
1 3 banana fruit
1 2 tomato vegetables
1 **2.5** carrot vegetables
1 3 potato vegetables
1 **2** peach fruit
2 3 apple fruit
2 **3** banana fruit
2 2 tomato vegetables
2 **2** carrot vegetables
2 3 peach fruit
I appreciate if anyone can help
Upvotes: 1
Views: 375
Reputation: 586
I guess you want to fill the missing values with the mean of the groups instead of the median. We can use .fillna()
along with .groupby()
and .transform()
functions to accomplish this with one line of code. First, let's create the DataFrame with the required columns.
# Create a DataFrame
df = pd.DataFrame({'user':['1','1','1','1','1','1', '2', '2', '2', '2', '2'],
'ffq':[1, 3, 2, np.nan, 3, np.nan, 3, np.nan, 2, np.nan, 3],
'food-category':['fruit', 'fruit', 'vegetables', 'vegetables',
'vegetables', 'fruit', 'fruit', 'fruit', 'vegetables',
'vegetables', 'fruit']})
We can now fill the missing values with desired imputation method such as mean, median or mode. Below imputation is done with mean to get the result mentioned in the question.
# Apply fillna function within each group
df['ffq'] = df.groupby(['user', 'food-category']).transform(lambda x: x.fillna(x.mean()))
user ffq food-category
0 1 1.0 fruit
1 1 3.0 fruit
2 1 2.0 vegetables
3 1 2.5 vegetables
4 1 3.0 vegetables
5 1 2.0 fruit
6 2 3.0 fruit
7 2 3.0 fruit
8 2 2.0 vegetables
9 2 2.0 vegetables
10 2 3.0 fruit
The .transform()
method is used to perform group-specific computation, which is mean in this example, and it returns a like-indexed object. See User Guide for more information.
Upvotes: 1
Reputation: 3224
Here is how you do that. First we need to sort the values, so that they appear in the correct order when we use the groupby. Next we calculate the mean and afterwards we need to fill the NaN
s with the series we extracted.
df = df.sort_values(['user','food-category'])
srs = df.dropna().groupby(['user','food-category']).agg({'ffq':'mean'})['ffq']
srs.index = df[df['ffq'].isnull()].index
df['ffq'] = df['ffq'].fillna(value=srs)
result
df.sort_index()
user ffq food food-category
0 1 1.0 apple fruit
1 1 3.0 banana fruit
2 1 2.0 tomato vegetables
3 1 2.5 carrot vegetables
4 1 3.0 potato vegetables
5 1 2.0 peach fruit
6 2 3.0 apple fruit
7 2 3.0 banana fruit
8 2 2.0 tomato vegetables
9 2 2.0 carrot vegetables
10 2 3.0 peach fruit
Upvotes: 0