Reputation: 145
I'm working with a list of values structured as dataframe and I'd like to compare each value of this list with another dataframe (kinda like this one below):
Name Start End
Blue 10 28
Red 23 25
Green 89 107
Purple 168 216
Yellow 21 40
Now let's suppose that the list of values is something like this:
Name Value
W 37
X 176
Y 43
Z 96
For each element in the "Value" column I'd like to check if that value is contained inside each "Start" - "End" range of the first dataframe and add it to a list (i.e. contained = []
). In the example W (37)
is contained in Yellow
, X (176)
in Purple
, Z (96)
in Green
while Y
is not matching so it will be excluded (or added to another list not_contained = []
).
How can I do that? Thank you all.
Upvotes: 2
Views: 2014
Reputation: 28729
You can use conditional_join from pyjanitor to simulate a range join, like in SQL -> note that this is in dev:
# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
import janitor
import pandas as pd
result = (df2.select_columns('Value')
.conditional_join(df1.select_columns('Start', 'End'),
('Value', 'Start', '>='),
('Value', 'End', '<='),
how = 'left')
)
Value Start End
0 37 21.0 40.0
1 176 168.0 216.0
2 43 NaN NaN
3 96 89.0 107.0
Filter into contained
and not_contained
based on null values in either Start
or End
:
contained = result.Value[result.Start.notna()].to_list()
not_contained = result.Value[~result.Start.notna()].to_list()
print(contained)
[37, 176, 96]
print(not_contained)
This is overkill though, and not really efficient, in my opinion; you do not need the dataframe, you just need some lists; a much simpler way would be to use Pandas' intervals, coupled with a for-loop; for large dataframes, I expect this to be more efficient:
# create intervalindex
intervals = pd.IntervalIndex.from_arrays(df1.Start, df1.End, closed = 'both')
# We could have used pandas get_indexer,
# however, the intervals are overlapping,
# and get_indexer works only with non-overlapping/unique indexes
contained = []
not_contained = []
for _, value in df2.Value.items():
if intervals.contains(value).any():
contained.append(value)
else:
not_contained.append(value)
print(contained)
[37, 176, 96]
print(not_contained)
[43]
Upvotes: 1
Reputation: 7223
You can turn your dataframe into a dictionary, then do a lookup for each item in your series.
import pandas as pd
ser = pd.Series(index=list("WXYZ"), data=[37, 176, 43, 96], name="Value")
df = pd.DataFrame(
{
"Start": [10, 23, 89, 168, 21],
"Name": ["Blue", "Red", "Green", "Purple", "Yellow"],
"End": [28, 25, 107, 216, 40],
}
)
df_dict = df.set_index(["Start", "End"])["Name"].to_dict()
ser.apply(
lambda x: next(
(color for bounds, color in df_dict.items() if x in range(*bounds)), None
)
)
which gives
W Yellow
X Purple
Y None
Z Green
Name: Value, dtype: object
The idea is to make df_dict
a dictionary with the keys as your bounds:
{(10, 28): 'Blue',
(23, 25): 'Red',
(89, 107): 'Green',
(168, 216): 'Purple',
(21, 40): 'Yellow'}
and then loop through your series asking if each element x
is in between the bounds, using the fact that
`
7 in range(3,5)
# False
7 in range(3,10)
# True
Upvotes: 2
Reputation: 101
As a loose solution, you can iterate over a row and match the value. I have created a solution using the mentioned approach. import pandas as pd
df = pd.DataFrame({'name': ['Blue', 'Red', 'Green', 'Purple', 'Yellow'],
'start':[10,23,89,168,21],
'end':[28,25,107,216,40]})
df2 = pd.DataFrame({'name':['W','X','Y','Z'],
'value':[37,176,43,96]})
contained=[]
not_contained = []
def checking(val):
# iterate over df rows
for index, row in df.iterrows():
if val >= row['start'] and val <= row['end']:
# if value is found, append to contained list and return
contained.append(val)
return True
# if value is not found, append it to not_contained list
not_contained.append(val)
# apply function
df2['value'].apply(checking)
print("contained: ",contained)
print("not_contained: ",not_contained)
Output:
contained: [37, 176, 96]
not_contained: [43]
Upvotes: 0
Reputation: 2924
Since there is no nice between operator join like in sql or whatever you could cross join them on a temp key and then filter the results.
df['key'] = 0
df2['key'] = 0
df.merge(df2, on='key', how='outer').query('Value >= Start & Value <= End').pop('Name_y').tolist()
Upvotes: 1