pd shah
pd shah

Reputation: 1406

Joining key columns on an index

there are two DataFrame, left and right.

left = pd.DataFrame({'C': ['C0', 'C1'],
                     'D': ['D0', 'D1']},
                     index=['K0', 'K1'])

right = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                      'B': ['B0', 'B1', 'B2', 'B3'],
                      'key': ['K0', 'K1', 'K0', 'K1']})

left:

enter image description here

right:

enter image description here

in the left frame keys are "k0, k1" that are index too, and are unique. in the right frame keys are "k0, k1, k0, k1", they can occure any number of time. the question is: how join those frames that the result be:

enter image description here

Upvotes: 3

Views: 84

Answers (2)

Ted Petrou
Ted Petrou

Reputation: 61947

Here is a solution that puts all values for the keys into a list within a Series and then converts that to a DataFrame. It also faster for bigger data.

right_flat = right.set_index('key').groupby(level='key').apply(lambda x: x.values.flatten().tolist())
left_flat = pd.Series(left.T.to_dict('list'))
all_flat = left_flat + right_flat
df_final = pd.concat([pd.Series(v, name=idx) for idx, v in all_flat.iteritems()], axis=1).T

Speed Test

Make bigger data

right = right.sample(100000, replace=True)

%%timeit
right_flat = right.set_index('key').groupby(level='key').apply(lambda x: x.values.flatten().tolist())
left_flat = pd.Series(left.T.to_dict('list'))
all_flat = left_flat + right_flat
df_final = pd.concat([pd.Series(v, name=idx) for idx, v in all_flat.iteritems()], axis=1).T

100 loops, best of 3: 29.1 ms per loop

%%timeit
g = right.groupby('key').cumcount()
right1 = right.groupby([right.key, g]).first().unstack().sort_index(axis=1, level=1)
right1.columns = right1.columns.droplevel(0)
df = pd.concat([left, right1], axis=1, join='inner')
df.columns = np.arange(len(df.columns))
df.index.name = None

10 loops, best of 3: 96 ms per loop

Upvotes: 1

jezrael
jezrael

Reputation: 862581

First need reshape right by pivot_table:

right1 = right.pivot_table(index='key', 
                           columns=right.groupby('key').cumcount(),
                           aggfunc='first').sort_index(axis=1, level=1)
right1.columns = right1.columns.droplevel(0)
print (right1)
      0   0   1   1
key                
K0   A0  B0  A2  B2
K1   A1  B1  A3  B3

Another solution for right1 is with groupby is with GroupBy.first:

g = right.groupby('key').cumcount()
right1 = right.groupby([right.key, g]).first().unstack().sort_index(axis=1, level=1)
right1.columns = right1.columns.droplevel(0)
print (right1)
      0   0   1   1
key                
K0   A0  B0  A2  B2
K1   A1  B1  A3  B3

And last use concat with inner join or merge too (how='inner' can be omit, because default parameter):

df = pd.concat([left, right1], axis=1, join='inner')
#assign default column names (0,1,2...)
df.columns = np.arange(len(df.columns))
df.index.name = None
print (df)
     0   1   2   3   4   5
K0  C0  D0  A0  B0  A2  B2
K1  C1  D1  A1  B1  A3  B3

df1 = pd.merge(left, right1, left_index=True, right_index=True)
#assign default column names (0,1,2...)
df1.columns = np.arange(len(df1.columns))
print (df1)
     0   1   2   3   4   5
K0  C0  D0  A0  B0  A2  B2
K1  C1  D1  A1  B1  A3  B3

Upvotes: 3

Related Questions