Reputation: 304

How to create a list based on column matches within 2 data frames? Python

In my actual dataset data is 35 million rows by 20 columns and data2 is 4000 rows by 10 columns. Although this code would work it would take so long that my system would time out. So I am looking for an alternate solution to run faster.

import pandas as pd

data = pd.DataFrame({'variable1':[1,2,3,4,0,6,7], 'variable2':[1,2,3,4,5,6,7], 'variable3':[1,200,3,4,50,6,7], 'variable4':[1,2,3,4,5,6,7]})

data2 = pd.DataFrame({'variable1':[2,0], 'variable2':[2,5], 'variable3':[200,50], 'variable4':[17,20]})


target = []

for i in range(len(data)):
    for j in range(len(data2)):
        if (data['variable1'].iloc[i] == data2['variable1'].iloc[j]) and (data['variable2'].iloc[i] == data2['variable2'].iloc[j]):
            target.append("Yes")
        else: target.append("No")

Proper output would be:
[[1,1,1,1,"No"],
[2,2,200,2,"Yes"],
[3,3,3,3,"No"],
[4,4,4,4,"No"],
[0,5,50,5,"Yes"],
[6,6,6,6,"No"],
[7,7,7,7,"No"]]

Upvotes: 2

Answers (3)

Naveed

Reputation: 11650

df2=data.merge(data2, 
           left_on=['variable1', 'variable2'],
           right_on=['variable1','variable2'],
           suffixes=(None,'_y'),a
           how='left'
          ).drop(columns='variable4_y').rename(columns={'variable3_y':'match'})
df2['match'] = np.where(df2['match'].isnull(), 'No', 'Yes')
df2.values.tolist()

[[1, 1, 1, 1, 'No'],
 [2, 2, 200, 2, 'Yes'],
 [3, 3, 3, 3, 'No'],
 [4, 4, 4, 4, 'No'],
 [0, 5, 50, 5, 'Yes'],
 [6, 6, 6, 6, 'No'],
 [7, 7, 7, 7, 'No']]

Upvotes: 0

Shubham Sharma

Reputation: 71689

`MultiIndex.isin`

c = ['variable1', 'variable2']
data['match'] = data.set_index(c).index.isin(data2.set_index(c).index)

   variable1  variable2  variable3  variable4  match
0          1          1          1          1  False
1          2          2        200          2   True
2          3          3          3          3  False
3          4          4          4          4  False
4          0          5         50          5   True
5          6          6          6          6  False
6          7          7          7          7  False

Upvotes: 1

mozway

Reputation: 260410

You want to test too many combinations (35M * 4K is 140 Billion, which is too big for pandas).

You can try with numpy, although it is still a lot of combinations and would require a lot of memory (hundreds of GB):

a1 = data['variable1'].to_numpy()
a2 = data['variable2'].to_numpy()

b1 = data2['variable1'].to_numpy()
b2 = data2['variable2'].to_numpy()

out = ((a1==b1[:,None])&(a2==b2[:,None])).ravel()

output:

array([False,  True, False, False, False, False, False, False, False,
       False, False,  True, False, False])

Upvotes: 1

How to create a list based on column matches within 2 data frames? Python

Answers (3)

MultiIndex.isin

Related Questions

`MultiIndex.isin`