Reputation: 63
I have 2 dataframes and I want to check if the Start, End ranges in DF1 are within the Start, End ranges in DF2 and for the ones that are true I want to print the ID and the region. I want to compare each row of DF1 to each row of DF2.
These are my dataframes:
DF1 = pd.DataFrame ({'Start':[500, 850, 1000],
'End':[700, 950, 1200],
'Region':["A", "B", "C"]})
DF2 = pd.DataFrame ({'Start':[200, 800, 1100],
'End':[750, 950, 1250],
'ID':[1, 2, 3]})
DF1
Start | End | Region |
---|---|---|
500 | 700 | A |
850 | 950 | B |
1000 | 1200 | C |
1100 | 1500 | D |
DF2
Start | End | ID |
---|---|---|
200 | 750 | 1 |
800 | 950 | 2 |
1100 | 1250 | 3 |
I assume that I have to write a for loop to iterate through all the rows. However, I am a beginner and I am having a hard time setting it up correctly.
This is the code I have tried so far.
for Start, End in DF1:
if Start>=DF2["Start"] and End<=DF2["End"]:
print (DF1["Region"], DF2["ID"])
However, I am getting this error: ValueError: too many values to unpack (expected 2)
Any advice on how to solve this would be greatly appreciated.
Upvotes: 4
Views: 549
Reputation: 28729
One option is with conditional_join from pyjanitor, and is efficient for range joins as well, and is better than a naive cross join:
# pip install pyjanitor
# you can also install the dev version for the latest
# including the ability to use numba for faster performance
# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
import janitor
import pandas as pd
(DF1
.conditional_join(
DF2,
('Start', 'Start', '>='),
('End', 'End', '<='))
)
left right
Start End Region Start End ID
0 500 700 A 200 750 1
1 850 950 B 800 950 2
With the dev
version, you can select the columns as well:
# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
import janitor
import pandas as pd
(DF1
.conditional_join(
DF2,
('Start', 'Start', '>='),
('End', 'End', '<='),
df_columns='Region',
right_columns='ID')
)
Region ID
0 A 1
1 B 2
With the dev
version, if you have numba
installed, you can enable it for more performance.
Upvotes: 0
Reputation: 23227
You can cross join the 2 dataframes by .merge()
and then use .query()
to filter the rows with the required condition, as follows:
DF_out = DF1.merge(DF2, how='cross').query('(Start_x >= Start_y) & (End_x <= End_y)')
If your Pandas version is older than 1.2.0 (December 2020 version) and does not support how='cross'
, you can use:
DF_out = DF1.assign(key=1).merge(DF2.assign(key=1), on='key').drop('key', axis=1).query('(Start_x >= Start_y) & (End_x <= End_y)')
Result:
Start_x
, End_x
are the original Start
, End
columns in DF1
Start_y
, End_y
are the original Start
, End
columns in DF2
print(DF_out)
Start_x End_x Region Start_y End_y ID
0 500 700 A 200 750 1
4 850 950 B 800 950 2
You can then easily print the Region
and ID
, e.g.
print(DF_out['Region'])
0 A
4 B
Name: Region, dtype: object
print(DF_out['ID'])
0 1
4 2
Name: ID, dtype: int64
If your checking criteria is strictly <
or >
rather than <=
or >=
, you can modify the comparison operator symbols in the .query()
, as follows:
DF_out = DF1.merge(DF2, how='cross').query('(Start_x > Start_y) & (End_x < End_y)')
Result:
print(DF_out)
Start_x End_x Region Start_y End_y ID
0 500 700 A 200 750 1
Upvotes: 1