Reputation: 325
Here's the problem: I know how to merge two DataFrame
s on indexes or on columns, but I am not able to merge them on a both indexes and columns.
I have two DataFrame
s and I want to merge them on indexes (which are dates) and on column id
. I created some sample data to better explain my problem.
from datetime import date
import numpy as np
import pandas as pd
np.random.seed(200)
dates = [date(2020, 1, 31), date(2020, 2, 28)]
a = {"id": ["A", "B"] * len(dates), "w": [.5, .5] * len(dates)}
b = {"id": ["B", "A"] * len(dates), "x": np.random.random(2 * len(dates))}
a = pd.DataFrame(a, index=dates * len(dates))
b = pd.DataFrame(b, index=dates * len(dates))
Desired output:
id w x
2020-01-31 A 0.5 0.226547
2020-02-28 B 0.5 0.947632
2020-01-31 A 0.5 0.428309
2020-02-28 B 0.5 0.594420
Please note that I am searching for a general solution, where a
and b
do not necessarily contains the same indexes or elements in id
.
Upvotes: 0
Views: 91
Reputation: 29635
IIUC, you can use set_index
to append the columns, use join
and then reset_index
like
print(a.set_index('id', append=True)\
.join(b.set_index('id', append=True), how='outer')\
.reset_index('id'))
id w x
2020-01-31 A 0.5 0.947632
2020-02-28 B 0.5 0.226547
2020-01-31 B 0.5 0.594420
2020-02-28 A 0.5 0.428309
or the opposite direction with merge
:
print(a.reset_index()\
.merge(b.reset_index(), on=['index', 'id'], how='outer')\
.set_index('index'))
id w x
index
2020-01-31 A 0.5 0.947632
2020-02-28 B 0.5 0.226547
2020-01-31 B 0.5 0.594420
2020-02-28 A 0.5 0.428309
Just to be sure that is what you want to do, let's assume a and b are like this with another id:
a = pd.DataFrame({"id": ["A", "B", 'B','A'] , "w": np.random.random(4)},
index=[date(2020, 1, 31), date(2020, 2, 28)]*2)
# id w
#2020-01-31 A 0.764141
#2020-02-28 B 0.002861
#2020-01-31 B 0.357424
#2020-02-28 A 0.909695
b = pd.DataFrame({"id": ["A", "B", 'C','A'], "x": np.random.random(4)},
index=[date(2020, 1, 31), date(2020, 2, 28)]*2)
# id x
#2020-01-31 A 0.456081
#2020-02-28 B 0.981803
#2020-01-31 C 0.867357
#2020-02-28 A 0.986028
Then the result of the method with join
is:
id w x
2020-01-31 A 0.764141 0.456081
2020-01-31 B 0.357424 NaN
2020-01-31 C NaN 0.867357
2020-02-28 A 0.909695 0.986028
2020-02-28 B 0.002861 0.981803
Upvotes: 2
Reputation: 1293
You can simply add a new column using b['w'] = a['w']
. This is note really a merge, but a copy from a into b.
The full code is :
from datetime import date
import numpy as np
import pandas as pd
np.random.seed(200)
ids = ["A", "B"]
dates = [date(2020, 1, 31), date(2020, 2, 28)]
a = {"id": ids * len(dates), "w": [.5, .5] * len(dates)}
b = {"id": ids * len(dates), "x": np.random.random(len(ids) * len(dates))}
a = pd.DataFrame(a, index=dates * len(dates))
b = pd.DataFrame(b, index=dates * len(dates))
b['w'] = a['w']
print(b)
EDIT: Other way to obtain the result you want (well, I'm not so sure because of the duplicate 'id' column). Please let me known the structure of the id of the two dataframe :
import pandas as pd
a = pandas.DataFrame([
['A', 0.5],
['B', 1],
['C', 1.5],
['D', 2.]],
columns=['id', 'w'],
index=['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04'])
print(a)
b = pandas.DataFrame([
['A', 0.5],
['B', 1],
['C', 1.5],
['D', 2.]],
columns=['id', 'x'],
index=['2020-01-02', '2020-01-03', '2020-01-04', '2020-01-05'])
print(b)
c = pandas.concat([a, b], axis=1)
print(c)
output:
id w
2020-01-01 A 0.5
2020-01-02 B 1.0
2020-01-03 C 1.5
2020-01-04 D 2.0
id x
2020-01-02 A 0.5
2020-01-03 B 1.0
2020-01-04 C 1.5
2020-01-05 D 2.0
id w id x
2020-01-01 A 0.5 NaN NaN
2020-01-02 B 1.0 A 0.5
2020-01-03 C 1.5 B 1.0
2020-01-04 D 2.0 C 1.5
2020-01-05 NaN NaN D 2.0
Upvotes: 1
Reputation: 24
It doesn't seem to be a merger issue but more feeding to me. Adding seems to work
a['x'] = b['x']
Upvotes: 0
Reputation: 153460
Use a helper column based on cumcount, as give indexes a name to make it eaiser to merge on indexes:
a['helper'] = a.groupby([a.index, 'id']).cumcount()
b['helper'] = b.groupby([b.index, 'id']).cumcount()
a = a.rename_axis('date')
b = b.rename_axis('date')
a.merge(b, on=['date','id','helper']).drop('helper', axis=1)
Output:
id w x
date
2020-01-31 A 0.5 0.947632
2020-02-28 B 0.5 0.226547
2020-01-31 A 0.5 0.594420
2020-02-28 B 0.5 0.428309
Upvotes: 1