Reputation: 5579
in my problem I have 2 dataframes mydataframe1
and mydataframe2
as below.
mydataframe1
Out[13]:
Start End Remove
50 60 1
61 105 0
106 150 1
151 160 0
161 180 1
181 200 0
201 400 1
mydataframe2
Out[14]:
Start End
55 100
105 140
151 154
155 185
220 240
From mydataframe2
I would like to remove the rows for which the interval Start-End are contained (also partially) in any of the "Remove"
=1 intervals in mydataframe1
. In other words there should not be any itnersection between the intervals of mydataframe2
and each of the intervals in mydataframe1
in this case mydataframe2 becomes
mydataframe2
Out[15]:
Start End
151 154
Upvotes: 3
Views: 456
Reputation: 28729
One option is with conditional_join
Note that this uses the dev version, which is optimised:
# pip install pyjanitor
# dev version, optimised
# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
(df2
.conditional_join(
df1.loc[df1.Remove==1],
('Start', 'End', '<='),
('End', 'Start', '>='),
how='left',
right_columns='Remove')
.loc[lambda df: df.Remove.isna(), df2.columns]
)
Start End
2 151 154
Upvotes: 0
Reputation: 1977
How about this:
mydataframe1['key']=1
mydataframe2['key']=1
df3 = mydataframe2.merge(mydataframe1, on="key")
df3['s_gt_s'] = df3.Start_y > df3.Start_x
df3['s_lt_e'] = df3.Start_y < df3.End_x
df3['e_gt_s'] = df3.End_y > df3.Start_x
df3['e_lt_e'] = df3.End_y < df3.End_x
df3['s_in'] = df3.s_gt_s & df3.s_lt_e
df3['e_in'] = df3.e_gt_s & df3.e_lt_e
df3['overlaps'] = df3.s_in | df3.e_in
my_new_dataframe = df3[df3.overlaps & df3.Remove==1][['End_x','Start_x']].drop_duplicates()
Upvotes: 0
Reputation: 77027
You could use pd.IntervalIndex
for intersections
Get rows to be removed
In [313]: dfr = df1.query('Remove == 1')
Construct IntervalIndex from to be removed ranges
In [314]: s1 = pd.IntervalIndex.from_arrays(dfr.Start, dfr.End, 'both')
Construct IntervalIndex from to be tested
In [315]: s2 = pd.IntervalIndex.from_arrays(df2.Start, df2.End, 'both')
Select rows of s2 which are not in s1 ranges
In [316]: df2.loc[[x not in s1 for x in s2]]
Out[316]:
Start End
2 151 154
Details
In [320]: df1
Out[320]:
Start End Remove
0 50 60 1
1 61 105 0
2 106 150 1
3 151 160 0
4 161 180 1
5 181 200 0
6 201 400 1
In [321]: df2
Out[321]:
Start End
0 55 100
1 105 140
2 151 154
3 155 185
4 220 240
In [322]: dfr
Out[322]:
Start End Remove
0 50 60 1
2 106 150 1
4 161 180 1
6 201 400 1
IntervalIndex details
In [323]: s1
Out[323]:
IntervalIndex([[50, 60], [106, 150], [161, 180], [201, 400]]
closed='both',
dtype='interval[int64]')
In [324]: s2
Out[324]:
IntervalIndex([[55, 100], [105, 140], [151, 154], [155, 185], [220, 240]]
closed='both',
dtype='interval[int64]')
In [326]: [x not in s1 for x in s2]
Out[326]: [False, False, True, False, False]
Upvotes: 2
Reputation: 210982
We can use Medial- or length-oriented tree: Overlap test:
In [143]: d1 = d1.assign(s=d1.Start+d1.End, d=d1.End-d1.Start)
In [144]: d2 = d2.assign(s=d2.Start+d2.End, d=d2.End-d2.Start)
In [145]: d1
Out[145]:
Start End Remove d s
0 50 60 1 10 110
1 61 105 0 44 166
2 106 150 1 44 256
3 151 160 0 9 311
4 161 180 1 19 341
5 181 200 0 19 381
6 201 400 1 199 601
In [146]: d2
Out[146]:
Start End d s
0 55 100 45 155
1 105 140 35 245
2 151 154 3 305
3 155 185 30 340
4 220 240 20 460
now we can check for overlapping intervals and filter:
In [148]: d2[~d2[['s','d']]\
...: .apply(lambda x: ((d1.loc[d1.Remove==1, 's'] - x.s).abs() <
...: d1.loc[d1.Remove==1, 'd'] +x.d).any(),
...: axis=1)]\
...: .drop(['s','d'], 1)
...:
Out[148]:
Start End
2 151 154
Upvotes: 1
Reputation: 2980
You can get all the unique range values from the columns marked Remove
then evaluate the Start
and End
dates contained in mydataframe2
are not in any of the range values. The first part will define all unique values falling with the Start/End values were Remove = 1.
start_end_remove = mydataframe1[mydataframe1['Remove'] == 1][['Start', 'End']].as_matrix()
remove_ranges = set([])
for x in start_end_remove:
remove_ranges.update(np.arange(x[0], x[1] + 1))
Next you can evaluate mydataframe2
against the unique set of range values. If the Start/End values of mydataframe2
are in the range of values they are removed from the dataframe by flagging whether they should be removed in a new columns. A function is defined to see if there is overlap between any of the ranges, then that function is applied to each row in mydataframe2
and remove the rows where the ranges do overlap.
def evaluate_in_range(x, remove_ranges):
s = x[0]
e = x[1]
eval_range = set(np.arange(s, e + 1))
if len(eval_range.intersection(remove_ranges)) > 0:
return 1
else:
return 0
mydataframe2['Remove'] = mydataframe2[['Start', 'End']].apply(lambda x: evaluate_in_range(x, remove_ranges), axis=1)
mydataframe2.drop(mydataframe2[mydataframe2['Remove'] == 1].index, inplace=True)
Upvotes: 0
Reputation: 82088
I think that this should work:
mydataframe2[mydataframe2.Start.isin(mydataframe1[mydataframe1.Remove != 0].Start)]
Breaking it down:
# This filter will remove anything which has Remove not 0
filter_non_remove = mydataframe1.Remove != 0
# This provides a valid Sequence of Start values
valid_starts = mydataframe1[mydataframe1.Remove != 0].Start
# Another filter, that checks whether the Start
# value is in the valid_starts Sequence
is_df2_valid = mydataframe2.Start.isin(valid_starts)
# Final applied filter
output = mydataframe2[is_df2_valid]
Upvotes: 0