gabboshow
gabboshow

Reputation: 5579

intersection 2 pandas dataframe

in my problem I have 2 dataframes mydataframe1 and mydataframe2 as below.

mydataframe1
Out[13]:
  Start   End       Remove     
  50      60        1  
  61      105       0  
  106     150       1  
  151     160       0  
  161     180       1  
  181     200       0  
  201     400       1  


mydataframe2
Out[14]: 
    Start   End  
    55      100
    105     140
    151     154
    155     185
    220     240    

From mydataframe2 I would like to remove the rows for which the interval Start-End are contained (also partially) in any of the "Remove"=1 intervals in mydataframe1. In other words there should not be any itnersection between the intervals of mydataframe2 and each of the intervals in mydataframe1

in this case mydataframe2 becomes

mydataframe2
Out[15]: 
    Start   End  
    151     154

Upvotes: 3

Views: 456

Answers (6)

sammywemmy
sammywemmy

Reputation: 28729

One option is with conditional_join

Note that this uses the dev version, which is optimised:

# pip install pyjanitor
# dev version, optimised
# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git

(df2
.conditional_join(
    df1.loc[df1.Remove==1], 
    ('Start', 'End', '<='), 
    ('End', 'Start', '>='), 
    how='left', 
    right_columns='Remove')
.loc[lambda df: df.Remove.isna(), df2.columns]
) 
   Start  End
2    151  154

Upvotes: 0

grepe
grepe

Reputation: 1977

How about this:

mydataframe1['key']=1
mydataframe2['key']=1

df3 = mydataframe2.merge(mydataframe1, on="key")

df3['s_gt_s'] = df3.Start_y > df3.Start_x
df3['s_lt_e'] = df3.Start_y < df3.End_x
df3['e_gt_s'] = df3.End_y > df3.Start_x
df3['e_lt_e'] = df3.End_y < df3.End_x

df3['s_in'] = df3.s_gt_s & df3.s_lt_e
df3['e_in'] = df3.e_gt_s & df3.e_lt_e

df3['overlaps'] = df3.s_in | df3.e_in

my_new_dataframe = df3[df3.overlaps & df3.Remove==1][['End_x','Start_x']].drop_duplicates()

Upvotes: 0

Zero
Zero

Reputation: 77027

You could use pd.IntervalIndex for intersections

Get rows to be removed

In [313]: dfr = df1.query('Remove == 1')

Construct IntervalIndex from to be removed ranges

In [314]: s1 = pd.IntervalIndex.from_arrays(dfr.Start, dfr.End, 'both')

Construct IntervalIndex from to be tested

In [315]: s2 = pd.IntervalIndex.from_arrays(df2.Start, df2.End, 'both')

Select rows of s2 which are not in s1 ranges

In [316]: df2.loc[[x not in s1 for x in s2]]
Out[316]:
   Start  End
2    151  154

Details

In [320]: df1
Out[320]:
   Start  End  Remove
0     50   60       1
1     61  105       0
2    106  150       1
3    151  160       0
4    161  180       1
5    181  200       0
6    201  400       1

In [321]: df2
Out[321]:
   Start  End
0     55  100
1    105  140
2    151  154
3    155  185
4    220  240

In [322]: dfr
Out[322]:
   Start  End  Remove
0     50   60       1
2    106  150       1
4    161  180       1
6    201  400       1

IntervalIndex details

In [323]: s1
Out[323]:
IntervalIndex([[50, 60], [106, 150], [161, 180], [201, 400]]
              closed='both',
              dtype='interval[int64]')

In [324]: s2
Out[324]:
IntervalIndex([[55, 100], [105, 140], [151, 154], [155, 185], [220, 240]]
              closed='both',
              dtype='interval[int64]')

In [326]: [x not in s1 for x in s2]
Out[326]: [False, False, True, False, False]

Upvotes: 2

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210982

We can use Medial- or length-oriented tree: Overlap test:

In [143]: d1 = d1.assign(s=d1.Start+d1.End, d=d1.End-d1.Start)

In [144]: d2 = d2.assign(s=d2.Start+d2.End, d=d2.End-d2.Start)

In [145]: d1
Out[145]:
   Start  End  Remove    d    s
0     50   60       1   10  110
1     61  105       0   44  166
2    106  150       1   44  256
3    151  160       0    9  311
4    161  180       1   19  341
5    181  200       0   19  381
6    201  400       1  199  601

In [146]: d2
Out[146]:
   Start  End   d    s
0     55  100  45  155
1    105  140  35  245
2    151  154   3  305
3    155  185  30  340
4    220  240  20  460

now we can check for overlapping intervals and filter:

In [148]: d2[~d2[['s','d']]\
     ...:       .apply(lambda x: ((d1.loc[d1.Remove==1, 's'] - x.s).abs() <
     ...:                         d1.loc[d1.Remove==1, 'd'] +x.d).any(),
     ...:              axis=1)]\
     ...:   .drop(['s','d'], 1)
     ...:
Out[148]:
   Start  End
2    151  154

Upvotes: 1

vielkind
vielkind

Reputation: 2980

You can get all the unique range values from the columns marked Remove then evaluate the Start and End dates contained in mydataframe2 are not in any of the range values. The first part will define all unique values falling with the Start/End values were Remove = 1.

start_end_remove = mydataframe1[mydataframe1['Remove'] == 1][['Start', 'End']].as_matrix()
remove_ranges = set([])
for x in start_end_remove:
    remove_ranges.update(np.arange(x[0], x[1] + 1))

Next you can evaluate mydataframe2 against the unique set of range values. If the Start/End values of mydataframe2 are in the range of values they are removed from the dataframe by flagging whether they should be removed in a new columns. A function is defined to see if there is overlap between any of the ranges, then that function is applied to each row in mydataframe2 and remove the rows where the ranges do overlap.

def evaluate_in_range(x, remove_ranges):
    s = x[0]
    e = x[1]
    eval_range = set(np.arange(s, e + 1))
    if len(eval_range.intersection(remove_ranges)) > 0:
        return 1
    else:
        return 0

mydataframe2['Remove'] = mydataframe2[['Start', 'End']].apply(lambda x: evaluate_in_range(x, remove_ranges), axis=1)
mydataframe2.drop(mydataframe2[mydataframe2['Remove'] == 1].index, inplace=True)

Upvotes: 0

cwallenpoole
cwallenpoole

Reputation: 82088

I think that this should work:

mydataframe2[mydataframe2.Start.isin(mydataframe1[mydataframe1.Remove != 0].Start)]

Breaking it down:

# This filter will remove anything which has Remove not 0
filter_non_remove = mydataframe1.Remove != 0

# This provides a valid Sequence of Start values
valid_starts = mydataframe1[mydataframe1.Remove != 0].Start

# Another filter, that checks whether the Start 
# value is in the valid_starts Sequence
is_df2_valid = mydataframe2.Start.isin(valid_starts)

# Final applied filter
output = mydataframe2[is_df2_valid]

Upvotes: 0

Related Questions