Iacopo Passeri
Iacopo Passeri

Reputation: 145

Python: how to check if a list of values is contained within a range

I'm working with a list of values structured as dataframe and I'd like to compare each value of this list with another dataframe (kinda like this one below):

Name    Start    End
Blue    10       28
Red     23       25
Green   89       107
Purple  168      216
Yellow  21       40

Now let's suppose that the list of values is something like this:

Name    Value
W       37
X       176
Y       43
Z       96

For each element in the "Value" column I'd like to check if that value is contained inside each "Start" - "End" range of the first dataframe and add it to a list (i.e. contained = []). In the example W (37) is contained in Yellow, X (176) in Purple, Z (96) in Green while Y is not matching so it will be excluded (or added to another list not_contained = []). How can I do that? Thank you all.

Upvotes: 2

Views: 2014

Answers (4)

sammywemmy
sammywemmy

Reputation: 28729

You can use conditional_join from pyjanitor to simulate a range join, like in SQL -> note that this is in dev:

# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
import janitor
import pandas as pd

result = (df2.select_columns('Value')
             .conditional_join(df1.select_columns('Start', 'End'), 
                               ('Value', 'Start', '>='), 
                               ('Value', 'End', '<='), 
                               how = 'left')
           )
 
   Value  Start    End
0     37   21.0   40.0
1    176  168.0  216.0
2     43    NaN    NaN
3     96   89.0  107.0

Filter into contained and not_contained based on null values in either Start or End:

contained = result.Value[result.Start.notna()].to_list()
not_contained = result.Value[~result.Start.notna()].to_list()

print(contained)
[37, 176, 96]

print(not_contained)

This is overkill though, and not really efficient, in my opinion; you do not need the dataframe, you just need some lists; a much simpler way would be to use Pandas' intervals, coupled with a for-loop; for large dataframes, I expect this to be more efficient:

# create intervalindex
intervals = pd.IntervalIndex.from_arrays(df1.Start, df1.End, closed = 'both')

# We could have used pandas get_indexer, 
# however, the intervals are overlapping,
# and get_indexer works only with non-overlapping/unique indexes

contained = []
not_contained = []

for _, value in df2.Value.items():
    if intervals.contains(value).any():
        contained.append(value)
    else:
        not_contained.append(value)

print(contained)
[37, 176, 96]

print(not_contained)
[43]

Upvotes: 1

user1717828
user1717828

Reputation: 7223

You can turn your dataframe into a dictionary, then do a lookup for each item in your series.

import pandas as pd

ser = pd.Series(index=list("WXYZ"), data=[37, 176, 43, 96], name="Value")

df = pd.DataFrame(
    {
        "Start": [10, 23, 89, 168, 21],
        "Name": ["Blue", "Red", "Green", "Purple", "Yellow"],
        "End": [28, 25, 107, 216, 40],
    }
)
df_dict = df.set_index(["Start", "End"])["Name"].to_dict()

ser.apply(
    lambda x: next(
        (color for bounds, color in df_dict.items() if x in range(*bounds)), None
    )
)

which gives

W    Yellow
X    Purple
Y      None
Z     Green
Name: Value, dtype: object

The idea is to make df_dict a dictionary with the keys as your bounds:

{(10, 28): 'Blue',
 (23, 25): 'Red',
 (89, 107): 'Green',
 (168, 216): 'Purple',
 (21, 40): 'Yellow'}

and then loop through your series asking if each element x is in between the bounds, using the fact that

`

7 in range(3,5)
# False

7 in range(3,10)
# True

Upvotes: 2

Gurjot Singh Mahi
Gurjot Singh Mahi

Reputation: 101

As a loose solution, you can iterate over a row and match the value. I have created a solution using the mentioned approach. import pandas as pd

df = pd.DataFrame({'name': ['Blue', 'Red', 'Green', 'Purple', 'Yellow'],
                   'start':[10,23,89,168,21],
                   'end':[28,25,107,216,40]})

df2 = pd.DataFrame({'name':['W','X','Y','Z'],
                    'value':[37,176,43,96]})

contained=[]
not_contained = []

def checking(val):
    # iterate over df rows
    for index, row in df.iterrows():
        if val >= row['start'] and val <= row['end']:
            # if value is found, append to contained list and return
            contained.append(val)
            return True
    # if value is not found, append it to not_contained list
    not_contained.append(val)

# apply function
df2['value'].apply(checking)

print("contained: ",contained)
print("not_contained: ",not_contained)

Output:

contained:  [37, 176, 96]
not_contained:  [43]

Upvotes: 0

Quixotic22
Quixotic22

Reputation: 2924

Since there is no nice between operator join like in sql or whatever you could cross join them on a temp key and then filter the results.

df['key'] = 0
df2['key'] = 0

df.merge(df2, on='key', how='outer').query('Value >= Start & Value <= End').pop('Name_y').tolist()

Upvotes: 1

Related Questions