Reputation: 2243
Here's my dataframe:
user1 user2 cat quantity + other quantities
----------------------------------------------------
Alice Bob 0 ....
Alice Bob 1 ....
Alice Bob 2 ....
Alice Carol 0 ....
Alice Carol 2 ....
I want to make sure that every user1-user2
pair has a row corresponding to each category (there are three: 0,1,2). If not, I want to insert a row, and set the other columns to zero.
user1 user2 cat quantity + other quantities
----------------------------------------------------
Alice Bob 0 ....
Alice Bob 1 ....
Alice Bob 2 ....
Alice Carol 0 ....
Alice Carol 1 <SET ALL TO ZERO>
Alice Carol 2 ....
what I have so far is the list of all user1-user2
which has less than 3 values for cat
:
df.groupby(['user1','user2']).agg({'cat':'count'}).reset_index()[['user1','user2']]
I could iterate over these users, but that will take a long time (there are >1M such pairs). I've checked at other solutions for inserting rows in pandas based on some condition (like Pandas/Python adding row based on condition and Insert row in Pandas Dataframe based on a condition) but they're not exactly the same.
Also, since this is a huge dataset, the solution has to be vectorized. How should I proceed?
Upvotes: 2
Views: 127
Reputation: 862581
Use set_index
with reindex
by MultiIndex.from_product
:
print (df)
user1 user2 cat quantity a
0 Alice Bob 0 2 4
1 Alice Bob 1 3 4
2 Alice Bob 2 4 4
3 Alice Carol 0 6 4
4 Alice Carol 2 3 4
df = df.set_index(['user1','user2', 'cat'])
mux = pd.MultiIndex.from_product(df.index.levels, names=df.index.names)
df = df.reindex(mux, fill_value=0).reset_index()
print (df)
user1 user2 cat quantity a
0 Alice Bob 0 2 4
1 Alice Bob 1 3 4
2 Alice Bob 2 4 4
3 Alice Carol 0 6 4
4 Alice Carol 1 0 0
5 Alice Carol 2 3 4
Another solution is create new Dataframe
by all combinations of unique values of columns and merge
with right
join:
from itertools import product
df1 = pd.DataFrame(list(product(df['user1'].unique(),
df['user2'].unique(),
df['cat'].unique())), columns=['user1','user2', 'cat'])
df = df.merge(df1, how='right').fillna(0)
print (df)
user1 user2 cat quantity a
0 Alice Bob 0 2.0 4.0
1 Alice Bob 1 3.0 4.0
2 Alice Bob 2 4.0 4.0
3 Alice Carol 0 6.0 4.0
4 Alice Carol 2 3.0 4.0
5 Alice Carol 1 0.0 0.0
EDIT2:
df['user1'] = df['user1'] + '_' + df['user2']
df = df.set_index(['user1', 'cat']).drop('user2', 1)
mux = pd.MultiIndex.from_product(df.index.levels, names=df.index.names)
df = df.reindex(mux, fill_value=0).reset_index()
df[['user1','user2']] = df['user1'].str.split('_', expand=True)
print (df)
user1 cat quantity a user2
0 Alice 0 2 4 Bob
1 Alice 1 3 4 Bob
2 Alice 2 4 4 Bob
3 Alice 0 6 4 Carol
4 Alice 1 0 0 Carol
5 Alice 2 3 4 Carol
EDIT3:
cols = df.columns.difference(['user1','user2'])
df = (df.groupby(['user1','user2'])[cols]
.apply(lambda x: x.set_index('cat').reindex(df['cat'].unique(), fill_value=0))
.reset_index())
print (df)
user1 user2 cat a quantity
0 Alice Bob 0 4 2
1 Alice Bob 1 4 3
2 Alice Bob 2 4 4
3 Alice Carol 0 4 6
4 Alice Carol 1 0 0
5 Alice Carol 2 4 3
Upvotes: 2