rbc-2019
rbc-2019

Reputation: 165

Merge two dataframes on a column of lists

I have a two dataframes.

df = pd.DataFrame([[1,2,3,[4,5]],[6,7,8,[9,10]]], columns=['a','b','c','d'])

df2 = pd.DataFrame([[4,'abc'],[5,'ef'], [10,'g'], [12,'hijk']], columns=['a_2','b_2'])

In [151]: df
Out[151]: 
   a  b  c        d
0  1  2  3   [4, 5]
1  6  7  8  [9, 10]

In [152]: df2
Out[152]: 
   a_2   b_2
0    4   abc
1    5    ef
2   10     g
3   12  hijk

I want to merge the two based on column 'd' of df and get the following output-

df3 = pd.DataFrame([[1,2,3,[4,5],['abc','ef']],[6,7,8,[9,10],['g']]], columns=['a','b','c','d','b_2'])


In [153]: df3
Out[153]: 
   a  b  c        d        b_2
0  1  2  3   [4, 5]  [abc, ef]
1  6  7  8  [9, 10]        [g]

I did try with 'merge' but I am not getting the required results.

Upvotes: 2

Views: 123

Answers (3)

rhug123
rhug123

Reputation: 8768

Try this:

(pd.merge(
    df.explode('d').reset_index(),
    df2,left_on='d',
    right_on='a_2',
    how = 'left'
    )
.groupby(['index',*list('abc')])[['d','b_2']].agg(lambda x: x.dropna().tolist())
.reset_index()
.drop('index',axis=1))

Output:

   a  b  c        d        b_2
0  1  2  3   [4, 5]  [abc, ef]
1  6  7  8  [9, 10]        [g]

Upvotes: 0

G.G
G.G

Reputation: 765

col1=df1.apply(lambda ss:df2.query("a_2 in @ss.d").b_2.tolist(),1)
df1.assign(b_2=col1)

:

┌───────┬───────┬───────┬───────┬─────────┬───────────┐
│ index │   a   │   b   │   c   │    d    │    b_2    │
│ int64 │ int64 │ int64 │ int64 │ int32[] │ varchar[] │
├───────┼───────┼───────┼───────┼─────────┼───────────┤
│     0 │     1 │     2 │     3 │ [4, 5]  │ [abc, ef] │
│     1 │     6 │     7 │     8 │ [9, 10] │ [g]       │
└───────┴───────┴───────┴───────┴─────────┴───────────┘

Upvotes: 0

cs95
cs95

Reputation: 402493

This isn't exactly a merge problem, but I'd do this using a list comprehension that calls Series.get:

s = df2.set_index('a_2')['b_2']  # mapping to use
[[s.get(y) for y in x if y in s] for x in df['d']]
# [['abc', 'ef'], ['g']]

# df['b_2'] = [[s.get(y) for y in x if y in s] for x in df['d']]
df3 = df.assign(b_2=[[s.get(y) for y in x if y in s] for x in df['d']])
df3

   a  b  c        d        b_2
0  1  2  3   [4, 5]  [abc, ef]
1  6  7  8  [9, 10]        [g]

Evidence suggests that list comprehensions should be reasonably fast given the non-vectorizable nature of your problem.

Upvotes: 2

Related Questions