max bottemanne
max bottemanne

Reputation: 21

How to find overlapping intervalIndex pandas

I have a dataframe containing lower and upper bounds of index, with a corresponding value, and I am trying to match the value of te corresponding index. My dataframe consists of 400.000 rows. an example of my dataframe:

df1 = pd.DataFrame({'low':[4,7,8],'high':[6,7,21],'value':[10,15,20]})
df2 = pd.DataFrame({'index':[4,5,6,7,8,9]})

Output:

  low   high  value
0  4     6     10
1  7     7     15
2  8     21    20

Now I want to add value of df1 to df2, whenever the index is between low and high, resulting in the following:

  index  value
0  4      10 
1  5      10
2  6      10
3  7      15
4  8      20
5  9      20

I tried to make an intervalIndex using the following post: Searching a particular value in a range among two columns python dataframe

v = df1.loc[:, 'low':'high'].apply(tuple, 1).tolist()
idx = pd.IntervalIndex.from_tuples(v, 'both')
df2['value'] = df1.iloc[idx.get_indexer(df2['index'].values), 'value'].values

But my intervals keep overlapping, I removed duplicated versions, but I still need to remove some overlapping intervals. One way to find this was using a for loop with:

[idx.overlaps(x) for x in idx]

But this takes a lot of time and everytime my memory fails. Is there some fast way to find the overlapping interval?

Upvotes: 2

Views: 817

Answers (4)

sammywemmy
sammywemmy

Reputation: 28659

An efficient option that scales well is conditional_join - it handles inequality joins efficiently:

# pip install pyjanitor
import janitor

(df2
.conditional_join(
    df1, 
    ('index', 'low', '>='), 
    ('index', 'high', '<='), 
    df_columns='index', 
    right_columns='value')
)
   index  value
0      4     10
1      5     10
2      6     10
3      7     15
4      8     20
5      9     20

Upvotes: 0

Kaveh Pouran
Kaveh Pouran

Reputation: 36

It's been a while since this thread has been opened. But I've recently had a similar problem. I found that was solved quite elegantly using Pandas IntervalIndex.

# Create an open IntervalIndex with both ends closed using setting 'both'
interval_idx = pd.IntervalIndex.from_arrays(df1.low, df1.high, 'both')

# Set the interval
df1 = df1.set_index(interval_idx)

# Define a function for getting all matching intervals from a dataframe
def get_interval_value(x, df=None):
    if df is not None:
        try:
            res = df.loc[x].value
        except:
            res = None
    else:
        res = None
    return res

# Compute interval overlaps
df2['value'] = df2['index'].apply(get_interval_value, df=df1)

Which returns the desired outcome of df2:

   index  value
0  4      10   
1  5      10   
2  6      10   
3  7      15   
4  8      20   
5  9      20   

Upvotes: 1

Rob Raymond
Rob Raymond

Reputation: 31146

Build a dataframe to join to using pd.concat()

df1 = pd.DataFrame({'low':[4,7,8],'high':[6,7,21],'value':[10,15,20]})
df2 = pd.DataFrame({'index':[4,5,6,7,8,9]}).set_index("index")

df2 = df2.join(pd.concat([pd.DataFrame(index=pd.RangeIndex(r[0],r[1]+1)).assign(value=r[2]) 
                    for r in df1.values])
         )

output

       value
index       
4         10
5         10
6         10
7         15
8         20
9         20

Upvotes: 1

sushanth
sushanth

Reputation: 8302

Here is a solution assuming index is sorted, create a dict with limit's and use dict against the dataframe then use ffill() to fill the gaps created by map.

between_ = (
    {**df1.set_index('low')['value'].to_dict(),
     **df1.set_index('high')['value'].to_dict()}
)
# {4: 10, 7: 15, 8: 20, 6: 10, 21: 20}

df2['index'].map(between_).ffill()

0    10.0
1    10.0
2    10.0
3    15.0
4    20.0
5    20.0
Name: index, dtype: float64

Upvotes: 1

Related Questions