iceAtNight7
iceAtNight7

Reputation: 304

How to create a list based on column matches within 2 data frames? Python

In my actual dataset data is 35 million rows by 20 columns and data2 is 4000 rows by 10 columns. Although this code would work it would take so long that my system would time out. So I am looking for an alternate solution to run faster.

import pandas as pd

data = pd.DataFrame({'variable1':[1,2,3,4,0,6,7], 'variable2':[1,2,3,4,5,6,7], 'variable3':[1,200,3,4,50,6,7], 'variable4':[1,2,3,4,5,6,7]})

data2 = pd.DataFrame({'variable1':[2,0], 'variable2':[2,5], 'variable3':[200,50], 'variable4':[17,20]})


target = []

for i in range(len(data)):
    for j in range(len(data2)):
        if (data['variable1'].iloc[i] == data2['variable1'].iloc[j]) and (data['variable2'].iloc[i] == data2['variable2'].iloc[j]):
            target.append("Yes")
        else: target.append("No")
Proper output would be:
[[1,1,1,1,"No"],
[2,2,200,2,"Yes"],
[3,3,3,3,"No"],
[4,4,4,4,"No"],
[0,5,50,5,"Yes"],
[6,6,6,6,"No"],
[7,7,7,7,"No"]]

Upvotes: 2

Views: 324

Answers (3)

Naveed
Naveed

Reputation: 11650

df2=data.merge(data2, 
           left_on=['variable1', 'variable2'],
           right_on=['variable1','variable2'],
           suffixes=(None,'_y'),a
           how='left'
          ).drop(columns='variable4_y').rename(columns={'variable3_y':'match'})
df2['match'] = np.where(df2['match'].isnull(), 'No', 'Yes')
df2.values.tolist()


[[1, 1, 1, 1, 'No'],
 [2, 2, 200, 2, 'Yes'],
 [3, 3, 3, 3, 'No'],
 [4, 4, 4, 4, 'No'],
 [0, 5, 50, 5, 'Yes'],
 [6, 6, 6, 6, 'No'],
 [7, 7, 7, 7, 'No']]

Upvotes: 0

Shubham Sharma
Shubham Sharma

Reputation: 71689

MultiIndex.isin

c = ['variable1', 'variable2']
data['match'] = data.set_index(c).index.isin(data2.set_index(c).index)

   variable1  variable2  variable3  variable4  match
0          1          1          1          1  False
1          2          2        200          2   True
2          3          3          3          3  False
3          4          4          4          4  False
4          0          5         50          5   True
5          6          6          6          6  False
6          7          7          7          7  False

Upvotes: 1

mozway
mozway

Reputation: 260410

You want to test too many combinations (35M * 4K is 140 Billion, which is too big for pandas).

You can try with numpy, although it is still a lot of combinations and would require a lot of memory (hundreds of GB):

a1 = data['variable1'].to_numpy()
a2 = data['variable2'].to_numpy()

b1 = data2['variable1'].to_numpy()
b2 = data2['variable2'].to_numpy()

out = ((a1==b1[:,None])&(a2==b2[:,None])).ravel()

output:

array([False,  True, False, False, False, False, False, False, False,
       False, False,  True, False, False])

Upvotes: 1

Related Questions