pythonbeginner
pythonbeginner

Reputation: 63

How can I check if the range of one dataframe is within the range of another dataframe

I have 2 dataframes and I want to check if the Start, End ranges in DF1 are within the Start, End ranges in DF2 and for the ones that are true I want to print the ID and the region. I want to compare each row of DF1 to each row of DF2.

These are my dataframes:

DF1 = pd.DataFrame ({'Start':[500, 850, 1000],
                    'End':[700, 950, 1200],
                    'Region':["A", "B", "C"]})

DF2 = pd.DataFrame ({'Start':[200, 800, 1100],
                    'End':[750, 950, 1250],
                    'ID':[1, 2, 3]})

DF1

Start End Region
500 700 A
850 950 B
1000 1200 C
1100 1500 D

DF2

Start End ID
200 750 1
800 950 2
1100 1250 3

I assume that I have to write a for loop to iterate through all the rows. However, I am a beginner and I am having a hard time setting it up correctly.

This is the code I have tried so far.

for Start, End in DF1:
    if Start>=DF2["Start"] and End<=DF2["End"]:
      print (DF1["Region"], DF2["ID"])

However, I am getting this error: ValueError: too many values to unpack (expected 2)

Any advice on how to solve this would be greatly appreciated.

Upvotes: 4

Views: 549

Answers (2)

sammywemmy
sammywemmy

Reputation: 28729

One option is with conditional_join from pyjanitor, and is efficient for range joins as well, and is better than a naive cross join:

# pip install pyjanitor
# you can also install the dev version for the latest
# including the ability to use numba for faster performance
# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git

import janitor
import pandas as pd

(DF1
.conditional_join(
    DF2, 
    ('Start', 'Start', '>='), 
    ('End', 'End', '<='))
)
   left             right
  Start  End Region Start  End ID
0   500  700      A   200  750  1
1   850  950      B   800  950  2

With the dev version, you can select the columns as well:

# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git

import janitor
import pandas as pd

(DF1
.conditional_join(
    DF2, 
    ('Start', 'Start', '>='), 
    ('End', 'End', '<='),
    df_columns='Region',
    right_columns='ID')
)

  Region  ID
0      A   1
1      B   2

With the dev version, if you have numba installed, you can enable it for more performance.

Upvotes: 0

SeaBean
SeaBean

Reputation: 23227

You can cross join the 2 dataframes by .merge() and then use .query() to filter the rows with the required condition, as follows:

DF_out = DF1.merge(DF2, how='cross').query('(Start_x >= Start_y) & (End_x <= End_y)')

If your Pandas version is older than 1.2.0 (December 2020 version) and does not support how='cross', you can use:

DF_out = DF1.assign(key=1).merge(DF2.assign(key=1), on='key').drop('key', axis=1).query('(Start_x >= Start_y) & (End_x <= End_y)')

Result:

Start_x, End_x are the original Start, End columns in DF1

Start_y, End_y are the original Start, End columns in DF2

print(DF_out)

   Start_x  End_x Region  Start_y  End_y  ID
0      500    700      A      200    750   1
4      850    950      B      800    950   2

You can then easily print the Region and ID, e.g.

print(DF_out['Region'])

0    A
4    B
Name: Region, dtype: object



print(DF_out['ID'])

0    1
4    2
Name: ID, dtype: int64

If your checking criteria is strictly < or > rather than <= or >=, you can modify the comparison operator symbols in the .query(), as follows:

DF_out = DF1.merge(DF2, how='cross').query('(Start_x > Start_y) & (End_x < End_y)')

Result:

print(DF_out)

   Start_x  End_x Region  Start_y  End_y  ID
0      500    700      A      200    750   1

Upvotes: 1

Related Questions