FunPlus
FunPlus

Reputation: 97

How to merge two dataframes according to their indexes?

I am trying to use Pandas for data analysis. I need to merge two dataframes according to their indexes. However, their indexes are totally different. The rule is that if the index of df2 is the substring of df1, then I should merge them. For example, df1.index == ['a/aa/aaa','b/bb/bbb', 'c/cc/ccc'], and df2.index == ['bb/bbb', 'ccc', 'hello']. Then df1 and df2 have two indexes in common, we should merge them based on these indexes. What should i do?

Upvotes: 0

Views: 338

Answers (2)

tlentali
tlentali

Reputation: 3455

Having your DataFrame :

>>> df1 = pd.DataFrame({'col_a': [1, 2, 3]}, index=['a/aa/aaa','b/bb/bbb', 'c/cc/ccc'])
>>> df2 = pd.DataFrame({'col_b': [4, 5, 6]}, index=['bb/bbb', 'ccc', 'hello'])

And changing the index to column :

>>> df1=df1.reset_index(drop=False)
>>> df1 = df1.rename(columns={'index': 'value_df1'})
>>> df1
    value_df1   col_a
0   a/aa/aaa    1
1   b/bb/bbb    2
2   c/cc/ccc    3

>>> df2=df2.reset_index(drop=False)
>>> df2 = df2.rename(columns={'index': 'value_df2'})
>>> df2
    value_df2       col_b
0   bb/bbb          4
1   ccc             5
2   hello           6

We merge both DataFrame on the joincolumn :

>>> df1['join'] = 1
>>> df2['join'] = 1
>>> dfFull = df1.merge(df2, on='join').drop('join', axis=1)
>>> dfFull
    value_df1   col_a   value_df2       col_b
0   a/aa/aaa    1       bb/bbb          4
1   a/aa/aaa    1       ccc             5
2   a/aa/aaa    1       hello           6
3   b/bb/bbb    2       bb/bbb          4
4   b/bb/bbb    2       ccc             5
5   b/bb/bbb    2       hello           6
6   c/cc/ccc    3       bb/bbb          4
7   c/cc/ccc    3       ccc             5
8   c/cc/ccc    3       hello           6

Then we use an apply to match the initial index value :

>>> df2.drop('join', axis=1, inplace=True)
>>> dfFull['match'] = dfFull.apply(lambda x: x['value_df1'].find(x['value_df2']), axis=1).ge(0)
>>> dfFull
    value_df1   col_a   value_df2       col_b   match
0   a/aa/aaa    1       bb/bbb          4       False
1   a/aa/aaa    1       ccc             5       False
2   a/aa/aaa    1       hello           6       False
3   b/bb/bbb    2       bb/bbb          4       True
4   b/bb/bbb    2       ccc             5       False
5   b/bb/bbb    2       hello           6       False
6   c/cc/ccc    3       bb/bbb          4       False
7   c/cc/ccc    3       ccc             5       True
8   c/cc/ccc    3       hello           6       False 

Filtering on the row where the column match is True and dropping the match column, we get the expected result :

>>> dfFull[dfFull['match']].drop(['match'], axis=1)
    value_df1   col_a   value_df2   col_b
3   b/bb/bbb    2       bb/bbb      4       
7   c/cc/ccc    3       ccc         5       

This solution is inspired by this post.

Upvotes: 1

Jason Cook
Jason Cook

Reputation: 1511

Since you have a known delimiter, you can split on that delimiter, do some merging, and then add back in the original data.

# sample data
df1 = pd.DataFrame({'ColumnA': [1,2,3]}, index=['a/aa/aaa','b/bb/bbb', 'c/cc/ccc'])
df2 = pd.DataFrame({'ColumnB': [4,5,6]}, index=['bb/bbb', 'ccc', 'hello'])

# set original index as column
# make a copy of each dataframe to preserve original data
# reset index of copy to keep track of original row number
df1 = df1.reset_index()
copy_df1 = df1
copy_df1.index.name = 'row_df1'
copy_df1 = df1.reset_index()

df2 = df2.reset_index()
copy_df2 = df2
copy_df2.index.name = 'row_df2'
copy_df2 = copy_df2.reset_index()

# split on known delimiter and explode into rows for each substring
copy_df1['index'] = copy_df1['index'].str.split('/')
copy_df1 = copy_df1.explode('index')

copy_df2['index'] = copy_df2['index'].str.split('/')
copy_df2 = copy_df2.explode('index')

# merge based on substrings, drop duplicates in case of multiple substring matches
mrg = copy_df1[['row_df1','index']].merge(copy_df2[['row_df2','index']]).drop(columns='index')
mrg = mrg.drop_duplicates()

# merge back in original details
mrg = mrg.merge(df1, left_on='row_df1', right_index=True)
mrg = mrg.merge(df2, left_on='row_df2', right_index=True, suffixes=('_df1','_df2'))

The final output would be:

   row_df1  row_df2 index_df1  ColumnA index_df2  ColumnB
0        1        0  b/bb/bbb        2    bb/bbb        4
2        2        1  c/cc/ccc        3       ccc        5

Upvotes: 0

Related Questions