Reputation: 26027
I do have two dataframes like this:
import pandas as pd
df1 = pd.DataFrame({"c1": range(5), "c2": range(1, 6)}, index=list("ABCDE"))
df2 = pd.DataFrame({"c3": range(15, 21), "c4": range(11, 17)}, index=["A_suf1", "B_suf2", "A_suf2", "C_suf2", "B_suf1", "D_suf1"])
c1 c2
A 0 1
B 1 2
C 2 3
D 3 4
E 4 5
c3 c4
A_suf1 15 11
B_suf2 16 12
A_suf2 17 13
C_suf2 18 14
B_suf1 19 15
D_suf1 20 16
which I want to turn into
c3 c4 c1
A_suf1 15 11 0
B_suf2 16 12 1
A_suf2 17 13 0
C_suf2 18 14 2
B_suf1 19 15 1
D_suf1 20 16 3
So, I want to concatenate the dataframes independent of the suffixes suf1
and suf2
in the index of df2
. As the entry in c1
in df1
for column A
is 0, it should then appear in the concatenated dataframe for the entries A_suf1
and A_suf2
.
I currently implement this as follows:
# store original name of index
old_index = df2.index
# temporary column which creates values which are in the the index of df1
df2['helper'] = df2.reset_index()["index"].apply(lambda x: x.split("_")[0]).tolist()
# prepare concat
df2 = df2.set_index("helper")
# concat
df_final = pd.concat([df2, df1.loc[:, "c1"]], join="inner", axis=1)
# reset index to original values
df_final.index = old_index
which gives me the desired output.
However, it requires a rather slow apply
and it will also fail if there is an index entry in df2
which is not in df1
. For example, the above code will fail for df2
equals
df2 = pd.DataFrame({"c3": range(15, 22), "c4": range(11, 18)}, index=["A_suf1", "B_suf2", "A_suf2", "C_suf2", "B_suf1", "D_suf1", "F_suf1"])
c3 c4
A_suf1 15 11
B_suf2 16 12
A_suf2 17 13
C_suf2 18 14
B_suf1 19 15
D_suf1 20 16
F_suf1 21 17
Question is therefore whether there is a more efficient and general solution out there for the working and also non-working case, respectively.
Upvotes: 2
Views: 71
Reputation: 402952
I don't think you need to convert to a MultiIndex
. You can just extract the indices and use df.loc
with pd.concat
to join.
idx = [x[0] for x in df2.index]
out = pd.concat([df2, df1.loc[idx, ['c1']].set_index(df2.index)], 1)
print(out)
c3 c4 c1
A_suf1 15 11 0
B_suf2 16 12 1
A_suf2 17 13 0
C_suf2 18 14 2
B_suf1 19 15 1
D_suf1 20 16 3
For your second dataframe, I get
c3 c4 c1
A_suf1 15 11 0.0
B_suf2 16 12 1.0
A_suf2 17 13 0.0
C_suf2 18 14 2.0
B_suf1 19 15 1.0
D_suf1 20 16 3.0
F_suf1 21 17 NaN
Performance
# converting to MultiIndex
100 loops, best of 3: 3.57 ms per loop
# using `df.loc` and `df.set_index`
1000 loops, best of 3: 1.53 ms per loop
Upvotes: 1
Reputation: 863351
Use join
by MultiIndex
created splitted index
of df2
with reindex
for Multiindex
of df1
:
df2.index = df2.index.str.split('_', expand=True)
print(df2)
c3 c4
A suf1 15 11
B suf2 16 12
A suf2 17 13
C suf2 18 14
B suf1 19 15
D suf1 20 16
print (df1['c1'].reindex(df2.index,level=0))
A suf1 0
B suf2 1
A suf2 0
C suf2 2
B suf1 1
D suf1 3
Name: c1, dtype: int32
df = df2.join(df1['c1'].reindex(df2.index,level=0))
#convert MultiIndex to index
df.index = df.index.map('_'.join)
print (df)
c3 c4 c1
A_suf1 15 11 0
B_suf2 16 12 1
A_suf2 17 13 0
C_suf2 18 14 2
B_suf1 19 15 1
D_suf1 20 16 3
For another df2
get:
print (df)
c3 c4 c1
A_suf1 15 11 0.0
B_suf2 16 12 1.0
A_suf2 17 13 0.0
C_suf2 18 14 2.0
B_suf1 19 15 1.0
D_suf1 20 16 3.0
F_suf1 21 17 NaN
Upvotes: 3