Reputation: 5759
I'm guessing there is a more efficient way to make the comparisons that I am making.
Currently, I have two pandas DataFrames.
DataFrame A
looks like this:
Location Tier Other
0 100 1 'Blah'
1 200 1 'Blah'
2 10 1 'Blah'
3 30 1 'Blah'
4 500 1 'Blah'
DataFrame B
looks like this:
Start Stop Tier Other
0 400 600 1 'Blah'
1 5 20 2 'Blah'
I would like to find all rows whose Location > Start
and Location < End
and Tier
matches. So, in the above example, row 4 from DataFrame A
has a Location
that is greater than 400
but less than 600
and the Tier
is ` in both DataFrames, so it should get returned somehow, like appending to a final DataFrame.
This is how I am making the comparisons now:
for i in A():
matching = matching.append(B[(B.Tier == i.Tier) & (B.Start < i.Location) & (B.Stop > i.Location)], ignore_index=True)
return matching
Is there a faster way of accomplishing this, as my code runs quite slowly?
Upvotes: 0
Views: 62
Reputation: 872
You can define Start
and Stop
when there are multiple rows with the same Tier in B.
def matching(row):
# use the first one
cur_row = B[B.Tier == row.Tier].iloc[0]
Start = cur_row.Start
Stop = cur_row.Stop
return row.Location > Start and row.Location < End
A[A.apply(matching, axis=1)]
Other example:
def matching(row):
# other example
cur_rows = B[B.Tier == row.Tier]
Start = cur_rows.Start.min()
Stop = cur_rows.Stop.max()
return row.Location > Start and row.Location < End
A[A.apply(matching, axis=1)]
Upvotes: 1
Reputation: 323366
Using numpy
broadcast
s1=df2.Start.values<df.Location.values[:,None]
s2=df2.Stop.values>df.Location.values[:,None]
s1&s2
Out[110]:
array([[False, False],
[False, False],
[False, True],
[False, False],
[ True, False]])
df[(s1&s2).any(1)]
Out[111]:
Location Tier Other
2 10 1 'Blah'
4 500 1 'Blah'
Upvotes: 1
Reputation: 134
The first thing that comes to my mind is sorting your data sets (for example, sort by Location and then by Tier for set A). Then you can use a binary search algorithm to vastly improve your search time.
Upvotes: 1