RJH2
RJH2

Reputation: 439

Merge two data frames while using boolean indices (to filter)

I have what may be a simple question related to syntax, but can't figure out.

I have two data frames, df1 and df2, that I'd like to a) merge on specific columns, while b) simultaneously checking another column in each data frame for a boolean relationship (>, <, or ==).

The crucial part is that I need to do both a and b simultaneously because the data frames are very large. It does not work to simply merge the two data frames in one step, then remove the rows that don't pass the boolean logic in a second step. This is because the merged data frame would be very, very large and cause me two run out of memory.

So, I have:

df1:
    Col_1   Col_2   Test_Value
0   A       B       1
1   B       A       3
2   A       B       2
3   B       A       5
4   A       B       2
5   B       A       1

and

df2:
    Col_1   Col_2   Test_Value
0   A       B       1
1   B       A       3
2   A       B       2
3   B       A       5
4   A       B       2
5   B       A       1

(for simplicity, the two data frames are identical)

And I'd like the to merge them, like so:

df3 = pd.merge(df1, df2, left_on=['Col_1'], right_on=['Col_2'])

While simultaneously filtering for any row where df1['Test Value'] is less than df2['Test Value'], like so:

df3.loc[df3['Test_Value_x'] < df3['Test_Value_y']]

The result would be:

    Col_1_x Col_2_x Test_Value_x    Col_1_y Col_2_y Test_Value_y
0   A       B       1               B       A       3
1   A       B       1               B       A       5
3   A       B       2               B       A       3
4   A       B       2               B       A       5
6   A       B       2               B       A       3
7   A       B       2               B       A       5
16  B       A       1               A       B       2
17  B       A       1               A       B       2

Again, I can do this in two steps, with the code above, but it creates a memory problem for me because the intermediate data frame would be so large.

So is there syntax that could combine this,

df3 = pd.merge(df1, df2, left_on=['Col_1'], right_on=['Col_2'])

with this,

df3.loc[df3['Test_Value_x'] < df3['Test_Value_y']]

Upvotes: 0

Views: 339

Answers (1)

ctrl-alt-delete
ctrl-alt-delete

Reputation: 3852

Try this:

import pandas as pd

df1_col1 = pd.Series(['A', 'B', 'A', 'B', 'A', 'B'], index=[0, 1, 2, 3, 4, 5 ])
df1_col2 = pd.Series(['B', 'A', 'B', 'A', 'B', 'A'], index=[0, 1, 2, 3, 4, 5])
df1_col3 = pd.Series([1, 3, 2, 5, 2, 1], index=[0, 1, 2, 3, 4, 5])
df1 = pd.concat([df1_col1, df1_col2, df1_col3], axis=1)

df1 = df1.rename(columns={0: 'Col_1', 1: 'Col_2', 2: 'Test_Value'})
df2 = df1.copy(deep=True)

To Answer your question as above:

df3 = pd.merge(df1, df2, left_on=['Col_1'], right_on=['Col_2'])[pd.merge(df1, df2, left_on=['Col_1'], right_on=['Col_2'])['Test_Value_x']
      <pd.merge(df1, df2, left_on=['Col_1'], right_on=['Col_2'])['Test_Value_y']]

Upvotes: 1

Related Questions