Reputation: 287
I want to merge two Pandas DataFrames. If the item
code (e.g. A,B,C,D) are the same,
their attributes a
, b
must be the same, but b
is a numpy array or a list that is
unhashable.
Foo:
item a b
A 1 [2,0]
B 1 [3,0]
C 0 [4,0]
Bar:
item a b
A 1 [2,0]
D 0 [6,1]
This is what I want
code a b Foo Bar
A 1 [2,0] 1 1
B 1 [3,0] 1 0
C 0 [4,0] 1 0
D 0 [6,1] 0 1
Upvotes: 6
Views: 19947
Reputation: 294318
cols = ['a', 'b', 'item']
pd.concat([Foo, Bar], keys=['Foo', 'Bar']) \
.assign(c=1).pipe(lambda d: d.assign(b=d.b.apply(tuple))) \
.set_index(cols, append=True) \
.c.unstack(0, fill_value=0).reset_index(cols) \
.pipe(lambda d: d.assign(b=d.b.apply(list)))
a b item Bar Foo
0 1 [2, 0] A 1 1
1 0 [6, 1] D 1 0
1 1 [3, 0] B 0 1
2 0 [4, 0] C 0 1
Upvotes: 2
Reputation: 323276
Or you can try this
foo.b = foo.b.apply(tuple)
bar.b = bar.b.apply(tuple)
df=pd.concat([foo,bar],axis=0).drop_duplicates()
df['foo']=df.isin(foo).a.astype(int)
df['bar']=df.isin(bar).a.astype(int)
df.b=df.b.apply(list)
df
Out[60]:
a b item foo bar
0 1 [2, 0] A 1 1
1 1 [3, 0] B 1 0
2 0 [4, 0] C 1 0
1 0 [6, 1] D 0 1
Upvotes: 3
Reputation: 879691
Here is a way to merge without converting the unhashables to tuples.
Since the item
code has a 1-to-1 correspondence with the values in the a
and
b
columns, it suffices to merge on item
alone. Since the values in the item
column are hashable,
there is no problem merging:
import pandas as pd
foo = pd.DataFrame({'item': list('ABC'), 'a':[1,1,0], 'b':[[2,0], [3,0], [4,0]]})
bar = pd.DataFrame({'item': list('AD'), 'a':[1,0], 'b':[[2,0], [6,1]]})
result = pd.merge(foo.assign(Foo=1), bar.assign(Bar=1), on='item', how='outer',
suffixes=['', '_y'])
for col in ['a','b']:
result[col].update(result[col+'_y'])
for col in ['Foo', 'Bar']:
result[col] = result[col].fillna(0)
result = result.drop(['a_y', 'b_y'], axis=1)
print(result)
yields
a b item Foo Bar
0 1.0 [2, 0] A 1.0 1.0
1 1.0 [3, 0] B 1.0 0.0
2 0.0 [4, 0] C 1.0 0.0
3 0.0 [6, 1] D 0.0 1.0
There is a bit of touch-up work needed after the merge, however. Since we only
merge on item
, result
gets two columns of a
and b
-- the ones from bar
are called a_y
, and b_y
. The update
method is used to fill in NaN values
from a
with corresponding values from a_y
, and then the same is also done
for b
.
The clever idea to use foo.assign(Foo=1), bar.assign(Bar=1)
to obtain the Foo
and Bar
columns was taken from cᴏʟᴅsᴘᴇᴇᴅ's solution.
Upvotes: 6
Reputation: 402553
You could use df.merge
and df.fillna
:
out = foo.assign(Foo=1).merge(bar.assign(Bar=1), 'outer').fillna(0)
print(out)
item a b Foo Bar
0 A 1 (2, 0) 1.0 1.0
1 B 1 (3, 0) 1.0 0.0
2 C 0 (4, 0) 1.0 0.0
3 D 0 (6, 1) 0.0 1.0
If b
is a list type, you could convert it to a tuple first and then merge.
foo.b = foo.b.apply(tuple)
bar.b = bar.b.apply(tuple)
out = foo.assign(Foo=1).merge(bar.assign(Bar=1), 'outer').fillna(0)
out.b = out.b.apply(list)
print(out)
item a b Foo Bar
0 A 1 [2, 0] 1.0 1.0
1 B 1 [3, 0] 1.0 0.0
2 C 0 [4, 0] 1.0 0.0
3 D 0 [6, 1] 0.0 1.0
Upvotes: 9