Reputation: 455
I have a dataframe like below.
user cat val
u1 cat1 1
u1 cat2 2
u1 cat3 3
u2 cat1 5
u3 cat4 4
And wish to transform it like this.
user cat val(avg)
u1 cat1 1
u1 cat2 2
u1 cat3 3
u1 cat4 4
u2 cat1 5
u2 cat2 2
u2 cat3 3
u2 cat4 4
u3 cat1 3
u3 cat2 2
u3 cat3 3
u3 cat4 4
There are also a couple of numeric columns to the right of cat, which hopefully can be filled with either NA; or if possible averaged out.
Upvotes: 4
Views: 3193
Reputation: 28644
You could use the complete function from pyjanitor to expose the missing values :
# pip install pyjanitor
import pandas as pd
import janitor as jn
df.complete('user', 'cat')
user cat val
0 u1 cat1 1.0
1 u1 cat2 2.0
2 u1 cat3 3.0
3 u1 cat4 NaN
4 u2 cat1 5.0
5 u2 cat2 NaN
6 u2 cat3 NaN
7 u2 cat4 NaN
8 u3 cat1 NaN
9 u3 cat2 NaN
10 u3 cat3 NaN
11 u3 cat4 4.0
Upvotes: 1
Reputation: 59529
You want to basically reindex your DataFrame
.
from itertools import product
idx = list(product(df['user'].unique(), df['cat'].unique()))
df = df.set_index(['user', 'cat']).reindex(idx).reset_index()
The second part seems to be setting the missing values (in this case NaN
after the reindex) to the average value from that category. You can accomplish this with a mapping and groupby
to calculate the average.
df.loc[df.val.isnull(), 'val'] = df.loc[df.val.isnull(), 'cat'].map(df.groupby('cat').val.mean())
df
is now:
user cat val
0 u1 cat1 1.0
1 u1 cat2 2.0
2 u1 cat3 3.0
3 u1 cat4 4.0
4 u2 cat1 5.0
5 u2 cat2 2.0
6 u2 cat3 3.0
7 u2 cat4 4.0
8 u3 cat1 3.0
9 u3 cat2 2.0
10 u3 cat3 3.0
11 u3 cat4 4.0
Upvotes: 1
Reputation: 11192
one way to solve this,
l1=df['user'].unique().tolist()
l2=df['cat'].unique().tolist()
new_df = pd.DataFrame(list(itertools.product(l1,l2))).rename(columns={0:'user',1:'cat'})
new_df=pd.merge(new_df,df,on=['user','cat'],how='left')
Output:
user cat val
0 u1 cat1 1.0
1 u1 cat2 2.0
2 u1 cat3 3.0
3 u1 cat4 NaN
4 u2 cat1 5.0
5 u2 cat2 NaN
6 u2 cat3 NaN
7 u2 cat4 NaN
8 u3 cat1 NaN
9 u3 cat2 NaN
10 u3 cat3 NaN
11 u3 cat4 4.0
Upvotes: 5
Reputation: 555
What you want is a cross join between the unique values of each column
You can do this creating two dataframes with unique values from each
df_col1 = pd.DataFrame(df['user'].unique(), columns=['user'])
df_col2 = pd.DataFrame(df['cat'].unique(), columns=['cat'])
# This step will make a dummy var in each dataframe which will be used to
# merge the dataframes
merged = df_col1.assign(key=1).merge(df_col2.assign(key=1)).drop(['key'], 1)
Upvotes: 0