Reputation: 5696
I tried to answer this question by a group-level merging. The below is a slightly modified version of the same question, but I need the output by a group-level merging.
Here are the input dataframes:
df = pd.DataFrame({ "group":[1,1,1 ,2,2],
"cat": ['a', 'b', 'c', 'a', 'c'] ,
"value": range(5),
"value2": np.array(range(5))* 2})
df
cat group value value2
a 1 0 0
b 1 1 2
c 1 2 4
a 2 3 6
c 2 4 8
categories = ['a', 'b', 'c', 'd']
categories = pd.DataFrame(['a', 'b', 'c', 'd'], columns=['cat'])
print(categories)
cat
0 a
1 b
2 c
3 d
Here's the expected output:
cat group value value2
a 1 0 0
b 1 1 2
c 1 2 4
d NA NA NA
a 2 3 6
c 2 4 8
b NA NA NA
d NA NA NA
Question:
I can achieve what I want by a for loop. Is there a pandas way to do that though?
(I need to perform an outer join between categories
and each group of the groupby result of df.groupby('group')
)
grouped = df.groupby('group')
merged_list = []
for g in grouped:
merged = pd.merge(categories, g[1], how = 'outer', on='cat')
merged_list.append(merged)
out = pd.concat(merged_list)
Upvotes: 4
Views: 8625
Reputation: 862641
I think groupby
+ merge
here is only overcomplicated way for this.
So faster is use reindex
by MultiIndex
:
mux = pd.MultiIndex.from_product([df['group'].unique(), categories], names=('group','cat'))
df = df.set_index(['group','cat']).reindex(mux).swaplevel(0,1).reset_index()
#add missing values to group column
df['group'] = df['group'].mask(df['value'].isnull())
print (df)
cat group value value2
0 a 1.0 0.0 0.0
1 b 1.0 1.0 2.0
2 c 1.0 2.0 4.0
3 d NaN NaN NaN
4 a 2.0 3.0 6.0
5 b NaN NaN NaN
6 c 2.0 4.0 8.0
7 d NaN NaN NaN
Possible solution:
df = df.groupby('group', group_keys=False)
.apply(lambda x: pd.merge(categories, x, how = 'outer', on='cat'))
cat group value value2
0 a 1.0 0.0 0.0
1 b 1.0 1.0 2.0
2 c 1.0 2.0 4.0
3 d NaN NaN NaN
0 a 2.0 3.0 6.0
1 b NaN NaN NaN
2 c 2.0 4.0 8.0
3 d NaN NaN NaN
Timings:
np.random.seed(123)
N = 1000000
L = list('abcd') #235,94.1,156ms
df = pd.DataFrame({'cat': np.random.choice(L, N, p=(0.002,0.002,0.005, 0.991)),
'group':np.random.randint(10000,size=N),
'value':np.random.randint(1000,size=N),
'value2':np.random.randint(5000,size=N)})
df = df.sort_values(['group','cat']).drop_duplicates(['group','cat']).reset_index(drop=True)
print (df.head(10))
categories = ['a', 'b', 'c', 'd']
def jez1(df):
mux = pd.MultiIndex.from_product([df['group'].unique(), categories], names=('group','cat'))
df = df.set_index(['group','cat']).reindex(mux, fill_value=0).swaplevel(0,1).reset_index()
df['group'] = df['group'].mask(df['value'].isnull())
return df
def jez2(df):
grouped = df.groupby('group')
categories = pd.DataFrame(['a', 'b', 'c', 'd'], columns=['cat'])
return grouped.apply(lambda x: pd.merge(categories, x, how = 'outer', on='cat'))
def coldspeed(df):
grouped = df.groupby('group')
categories = pd.DataFrame(['a', 'b', 'c', 'd'], columns=['cat'])
return pd.concat([g[1].merge(categories, how='outer', on='cat') for g in grouped])
def akilat90(df):
grouped = df.groupby('group')
categories = pd.DataFrame(['a', 'b', 'c', 'd'], columns=['cat'])
merged_list = []
for g in grouped:
merged = pd.merge(categories, g[1], how = 'outer', on='cat')
merged['group'].fillna(merged['group'].mode()[0],inplace=True) # replace the `group` column's `NA`s by mode
merged.fillna(0, inplace=True)
merged_list.append(merged)
return pd.concat(merged_list)
In [471]: %timeit jez1(df)
100 loops, best of 3: 12 ms per loop
In [472]: %timeit jez2(df)
1 loop, best of 3: 14.5 s per loop
In [473]: %timeit coldspeed(df)
1 loop, best of 3: 19.4 s per loop
In [474]: %timeit akilat90(df)
1 loop, best of 3: 22.3 s per loop
Upvotes: 4
Reputation: 402493
To actually answer your question, no - you can only merge 2 dataframes at a time (I'm not aware of multi-way merges in pandas). You cannot avoid the loop, but you certainly can make your code a little neater.
pd.concat([g[1].merge(categories, how='outer', on='cat') for g in grouped])
cat group value value2
0 a 1.0 0.0 0.0
1 b 1.0 1.0 2.0
2 c 1.0 2.0 4.0
3 d NaN NaN NaN
0 a 2.0 3.0 6.0
1 c 2.0 4.0 8.0
2 b NaN NaN NaN
3 d NaN NaN NaN
Upvotes: 0