Reputation: 1406
there are two DataFrame, left and right.
left = pd.DataFrame({'C': ['C0', 'C1'],
'D': ['D0', 'D1']},
index=['K0', 'K1'])
right = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'key': ['K0', 'K1', 'K0', 'K1']})
left:
right:
in the left frame keys are "k0, k1" that are index too, and are unique. in the right frame keys are "k0, k1, k0, k1", they can occure any number of time. the question is: how join those frames that the result be:
Upvotes: 3
Views: 84
Reputation: 61947
Here is a solution that puts all values for the keys into a list within a Series and then converts that to a DataFrame. It also faster for bigger data.
right_flat = right.set_index('key').groupby(level='key').apply(lambda x: x.values.flatten().tolist())
left_flat = pd.Series(left.T.to_dict('list'))
all_flat = left_flat + right_flat
df_final = pd.concat([pd.Series(v, name=idx) for idx, v in all_flat.iteritems()], axis=1).T
Make bigger data
right = right.sample(100000, replace=True)
%%timeit
right_flat = right.set_index('key').groupby(level='key').apply(lambda x: x.values.flatten().tolist())
left_flat = pd.Series(left.T.to_dict('list'))
all_flat = left_flat + right_flat
df_final = pd.concat([pd.Series(v, name=idx) for idx, v in all_flat.iteritems()], axis=1).T
100 loops, best of 3: 29.1 ms per loop
%%timeit
g = right.groupby('key').cumcount()
right1 = right.groupby([right.key, g]).first().unstack().sort_index(axis=1, level=1)
right1.columns = right1.columns.droplevel(0)
df = pd.concat([left, right1], axis=1, join='inner')
df.columns = np.arange(len(df.columns))
df.index.name = None
10 loops, best of 3: 96 ms per loop
Upvotes: 1
Reputation: 862581
First need reshape right
by pivot_table
:
right1 = right.pivot_table(index='key',
columns=right.groupby('key').cumcount(),
aggfunc='first').sort_index(axis=1, level=1)
right1.columns = right1.columns.droplevel(0)
print (right1)
0 0 1 1
key
K0 A0 B0 A2 B2
K1 A1 B1 A3 B3
Another solution for right1
is with groupby
is with GroupBy.first
:
g = right.groupby('key').cumcount()
right1 = right.groupby([right.key, g]).first().unstack().sort_index(axis=1, level=1)
right1.columns = right1.columns.droplevel(0)
print (right1)
0 0 1 1
key
K0 A0 B0 A2 B2
K1 A1 B1 A3 B3
And last use concat
with inner
join or merge
too (how='inner' can be omit, because default parameter):
df = pd.concat([left, right1], axis=1, join='inner')
#assign default column names (0,1,2...)
df.columns = np.arange(len(df.columns))
df.index.name = None
print (df)
0 1 2 3 4 5
K0 C0 D0 A0 B0 A2 B2
K1 C1 D1 A1 B1 A3 B3
df1 = pd.merge(left, right1, left_index=True, right_index=True)
#assign default column names (0,1,2...)
df1.columns = np.arange(len(df1.columns))
print (df1)
0 1 2 3 4 5
K0 C0 D0 A0 B0 A2 B2
K1 C1 D1 A1 B1 A3 B3
Upvotes: 3