Reputation: 21
I have a dataframe containing lower and upper bounds of index, with a corresponding value, and I am trying to match the value of te corresponding index. My dataframe consists of 400.000 rows. an example of my dataframe:
df1 = pd.DataFrame({'low':[4,7,8],'high':[6,7,21],'value':[10,15,20]})
df2 = pd.DataFrame({'index':[4,5,6,7,8,9]})
Output:
low high value
0 4 6 10
1 7 7 15
2 8 21 20
Now I want to add value of df1 to df2, whenever the index is between low and high, resulting in the following:
index value
0 4 10
1 5 10
2 6 10
3 7 15
4 8 20
5 9 20
I tried to make an intervalIndex using the following post: Searching a particular value in a range among two columns python dataframe
v = df1.loc[:, 'low':'high'].apply(tuple, 1).tolist()
idx = pd.IntervalIndex.from_tuples(v, 'both')
df2['value'] = df1.iloc[idx.get_indexer(df2['index'].values), 'value'].values
But my intervals keep overlapping, I removed duplicated versions, but I still need to remove some overlapping intervals. One way to find this was using a for loop with:
[idx.overlaps(x) for x in idx]
But this takes a lot of time and everytime my memory fails. Is there some fast way to find the overlapping interval?
Upvotes: 2
Views: 817
Reputation: 28659
An efficient option that scales well is conditional_join - it handles inequality joins efficiently:
# pip install pyjanitor
import janitor
(df2
.conditional_join(
df1,
('index', 'low', '>='),
('index', 'high', '<='),
df_columns='index',
right_columns='value')
)
index value
0 4 10
1 5 10
2 6 10
3 7 15
4 8 20
5 9 20
Upvotes: 0
Reputation: 36
It's been a while since this thread has been opened. But I've recently had a similar problem. I found that was solved quite elegantly using Pandas IntervalIndex.
# Create an open IntervalIndex with both ends closed using setting 'both'
interval_idx = pd.IntervalIndex.from_arrays(df1.low, df1.high, 'both')
# Set the interval
df1 = df1.set_index(interval_idx)
# Define a function for getting all matching intervals from a dataframe
def get_interval_value(x, df=None):
if df is not None:
try:
res = df.loc[x].value
except:
res = None
else:
res = None
return res
# Compute interval overlaps
df2['value'] = df2['index'].apply(get_interval_value, df=df1)
Which returns the desired outcome of df2
:
index value
0 4 10
1 5 10
2 6 10
3 7 15
4 8 20
5 9 20
Upvotes: 1
Reputation: 31146
Build a dataframe to join to using pd.concat()
df1 = pd.DataFrame({'low':[4,7,8],'high':[6,7,21],'value':[10,15,20]})
df2 = pd.DataFrame({'index':[4,5,6,7,8,9]}).set_index("index")
df2 = df2.join(pd.concat([pd.DataFrame(index=pd.RangeIndex(r[0],r[1]+1)).assign(value=r[2])
for r in df1.values])
)
output
value
index
4 10
5 10
6 10
7 15
8 20
9 20
Upvotes: 1
Reputation: 8302
Here is a solution assuming index is sorted, create a dict
with limit's and use dict against the dataframe then use ffill()
to fill the gaps created by map.
between_ = (
{**df1.set_index('low')['value'].to_dict(),
**df1.set_index('high')['value'].to_dict()}
)
# {4: 10, 7: 15, 8: 20, 6: 10, 21: 20}
df2['index'].map(between_).ffill()
0 10.0
1 10.0
2 10.0
3 15.0
4 20.0
5 20.0
Name: index, dtype: float64
Upvotes: 1