davide
davide

Reputation: 325

how to merge two dataframes on both indexes and columns

Here's the problem: I know how to merge two DataFrames on indexes or on columns, but I am not able to merge them on a both indexes and columns.

I have two DataFrames and I want to merge them on indexes (which are dates) and on column id. I created some sample data to better explain my problem.

from datetime import date
import numpy as np
import pandas as pd

np.random.seed(200)
dates = [date(2020, 1, 31), date(2020, 2, 28)]
a = {"id": ["A", "B"] * len(dates), "w": [.5, .5] * len(dates)}
b = {"id": ["B", "A"] * len(dates), "x": np.random.random(2 * len(dates))}

a = pd.DataFrame(a, index=dates * len(dates))
b = pd.DataFrame(b, index=dates * len(dates))

Desired output:

           id    w         x
2020-01-31  A  0.5  0.226547
2020-02-28  B  0.5  0.947632
2020-01-31  A  0.5  0.428309
2020-02-28  B  0.5  0.594420

Please note that I am searching for a general solution, where a and b do not necessarily contains the same indexes or elements in id.

Upvotes: 0

Views: 91

Answers (4)

Ben.T
Ben.T

Reputation: 29635

IIUC, you can use set_index to append the columns, use join and then reset_index like

print(a.set_index('id', append=True)\
       .join(b.set_index('id', append=True), how='outer')\
       .reset_index('id'))
           id    w         x
2020-01-31  A  0.5  0.947632
2020-02-28  B  0.5  0.226547
2020-01-31  B  0.5  0.594420
2020-02-28  A  0.5  0.428309

or the opposite direction with merge:

print(a.reset_index()\
       .merge(b.reset_index(), on=['index', 'id'], how='outer')\
       .set_index('index'))
           id    w         x
index                       
2020-01-31  A  0.5  0.947632
2020-02-28  B  0.5  0.226547
2020-01-31  B  0.5  0.594420
2020-02-28  A  0.5  0.428309

Just to be sure that is what you want to do, let's assume a and b are like this with another id:

a = pd.DataFrame({"id": ["A", "B", 'B','A'] , "w": np.random.random(4)}, 
                 index=[date(2020, 1, 31), date(2020, 2, 28)]*2)
#           id         w
#2020-01-31  A  0.764141
#2020-02-28  B  0.002861
#2020-01-31  B  0.357424
#2020-02-28  A  0.909695

b = pd.DataFrame({"id": ["A", "B", 'C','A'], "x": np.random.random(4)}, 
                 index=[date(2020, 1, 31), date(2020, 2, 28)]*2)
#           id         x
#2020-01-31  A  0.456081
#2020-02-28  B  0.981803
#2020-01-31  C  0.867357
#2020-02-28  A  0.986028

Then the result of the method with join is:

           id         w         x
2020-01-31  A  0.764141  0.456081
2020-01-31  B  0.357424       NaN
2020-01-31  C       NaN  0.867357
2020-02-28  A  0.909695  0.986028
2020-02-28  B  0.002861  0.981803

Upvotes: 2

pyOliv
pyOliv

Reputation: 1293

You can simply add a new column using b['w'] = a['w']. This is note really a merge, but a copy from a into b.

The full code is :

from datetime import date
import numpy as np
import pandas as pd

np.random.seed(200)
ids = ["A", "B"]
dates = [date(2020, 1, 31), date(2020, 2, 28)]
a = {"id": ids * len(dates), "w": [.5, .5] * len(dates)}
b = {"id": ids * len(dates), "x": np.random.random(len(ids) * len(dates))}

a = pd.DataFrame(a, index=dates * len(dates))
b = pd.DataFrame(b, index=dates * len(dates))

b['w'] = a['w']
print(b)

EDIT: Other way to obtain the result you want (well, I'm not so sure because of the duplicate 'id' column). Please let me known the structure of the id of the two dataframe :

import pandas as pd

a = pandas.DataFrame([
    ['A', 0.5],
    ['B', 1],
    ['C', 1.5],
    ['D', 2.]],
    columns=['id', 'w'], 
    index=['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04'])
print(a)

b = pandas.DataFrame([
    ['A', 0.5],
    ['B', 1],
    ['C', 1.5],
    ['D', 2.]],
    columns=['id', 'x'], 
    index=['2020-01-02', '2020-01-03', '2020-01-04', '2020-01-05'])
print(b)

c = pandas.concat([a, b], axis=1)
print(c)

output:

           id    w
2020-01-01  A  0.5
2020-01-02  B  1.0
2020-01-03  C  1.5
2020-01-04  D  2.0
           id    x
2020-01-02  A  0.5
2020-01-03  B  1.0
2020-01-04  C  1.5
2020-01-05  D  2.0
             id    w   id    x
2020-01-01    A  0.5  NaN  NaN
2020-01-02    B  1.0    A  0.5
2020-01-03    C  1.5    B  1.0
2020-01-04    D  2.0    C  1.5
2020-01-05  NaN  NaN    D  2.0

Upvotes: 1

Thabris
Thabris

Reputation: 24

It doesn't seem to be a merger issue but more feeding to me. Adding seems to work

a['x'] = b['x']

Upvotes: 0

Scott Boston
Scott Boston

Reputation: 153460

Use a helper column based on cumcount, as give indexes a name to make it eaiser to merge on indexes:

a['helper'] = a.groupby([a.index, 'id']).cumcount()
b['helper'] = b.groupby([b.index, 'id']).cumcount()
a = a.rename_axis('date')
b = b.rename_axis('date')

a.merge(b, on=['date','id','helper']).drop('helper', axis=1)

Output:

           id    w         x
date                        
2020-01-31  A  0.5  0.947632
2020-02-28  B  0.5  0.226547
2020-01-31  A  0.5  0.594420
2020-02-28  B  0.5  0.428309

Upvotes: 1

Related Questions