Cleb
Cleb

Reputation: 26027

Efficient concatenation ignoring suffixes in index

I do have two dataframes like this:

import pandas as pd

df1 = pd.DataFrame({"c1": range(5), "c2": range(1, 6)}, index=list("ABCDE"))
df2 = pd.DataFrame({"c3": range(15, 21), "c4": range(11, 17)}, index=["A_suf1", "B_suf2", "A_suf2", "C_suf2", "B_suf1", "D_suf1"])   

   c1  c2    
A   0   1
B   1   2
C   2   3
D   3   4
E   4   5

        c3  c4
A_suf1  15  11
B_suf2  16  12
A_suf2  17  13
C_suf2  18  14
B_suf1  19  15
D_suf1  20  16

which I want to turn into

       c3  c4  c1
A_suf1  15  11   0
B_suf2  16  12   1
A_suf2  17  13   0
C_suf2  18  14   2
B_suf1  19  15   1
D_suf1  20  16   3

So, I want to concatenate the dataframes independent of the suffixes suf1 and suf2 in the index of df2. As the entry in c1 in df1 for column A is 0, it should then appear in the concatenated dataframe for the entries A_suf1 and A_suf2.

I currently implement this as follows:

# store original name of index
old_index = df2.index
# temporary column which creates values which are in the the index of df1
df2['helper'] = df2.reset_index()["index"].apply(lambda x: x.split("_")[0]).tolist()
# prepare concat
df2 = df2.set_index("helper")
# concat
df_final = pd.concat([df2, df1.loc[:, "c1"]], join="inner", axis=1)
# reset index to original values
df_final.index = old_index

which gives me the desired output.

However, it requires a rather slow apply and it will also fail if there is an index entry in df2 which is not in df1. For example, the above code will fail for df2 equals

df2 = pd.DataFrame({"c3": range(15, 22), "c4": range(11, 18)}, index=["A_suf1", "B_suf2", "A_suf2", "C_suf2", "B_suf1", "D_suf1", "F_suf1"])

        c3  c4
A_suf1  15  11
B_suf2  16  12
A_suf2  17  13
C_suf2  18  14
B_suf1  19  15
D_suf1  20  16
F_suf1  21  17

Question is therefore whether there is a more efficient and general solution out there for the working and also non-working case, respectively.

Upvotes: 2

Views: 71

Answers (2)

cs95
cs95

Reputation: 402952

I don't think you need to convert to a MultiIndex. You can just extract the indices and use df.loc with pd.concat to join.

idx = [x[0] for x in df2.index]
out = pd.concat([df2, df1.loc[idx, ['c1']].set_index(df2.index)], 1)
print(out)
        c3  c4  c1
A_suf1  15  11   0
B_suf2  16  12   1
A_suf2  17  13   0
C_suf2  18  14   2
B_suf1  19  15   1
D_suf1  20  16   3

For your second dataframe, I get

        c3  c4   c1
A_suf1  15  11  0.0
B_suf2  16  12  1.0
A_suf2  17  13  0.0
C_suf2  18  14  2.0
B_suf1  19  15  1.0
D_suf1  20  16  3.0
F_suf1  21  17  NaN

Performance

Small

# converting to MultiIndex
100 loops, best of 3: 3.57 ms per loop

# using `df.loc` and `df.set_index`
1000 loops, best of 3: 1.53 ms per loop

Upvotes: 1

jezrael
jezrael

Reputation: 863351

Use join by MultiIndex created splitted index of df2 with reindex for Multiindex of df1:

df2.index = df2.index.str.split('_', expand=True)
print(df2)
        c3  c4
A suf1  15  11
B suf2  16  12
A suf2  17  13
C suf2  18  14
B suf1  19  15
D suf1  20  16

print (df1['c1'].reindex(df2.index,level=0))
A  suf1    0
B  suf2    1
A  suf2    0
C  suf2    2
B  suf1    1
D  suf1    3
Name: c1, dtype: int32

df = df2.join(df1['c1'].reindex(df2.index,level=0))
#convert MultiIndex to index
df.index = df.index.map('_'.join)
print (df)
        c3  c4  c1
A_suf1  15  11   0
B_suf2  16  12   1
A_suf2  17  13   0
C_suf2  18  14   2
B_suf1  19  15   1
D_suf1  20  16   3

For another df2 get:

print (df)
        c3  c4   c1
A_suf1  15  11  0.0
B_suf2  16  12  1.0
A_suf2  17  13  0.0
C_suf2  18  14  2.0
B_suf1  19  15  1.0
D_suf1  20  16  3.0
F_suf1  21  17  NaN

Upvotes: 3

Related Questions