Reputation: 571
I have a dataframe and a list of values.
vehicles = ['van','car','bike']
a | b | c |
---|---|---|
car | dog | 1 |
van | bike | 2 |
van | car | 3 |
dog | van | 4 |
To find all rows where both values are a vehicle
I could to the following
toy_df[ ( toy_df['a'].isin(vehicles) ) & ( toy_df['b'].isin(vehicles) ) ]
... and this works fine. However my data is much larger, and I'm thinking using indexes would be more efficient. If I were to set the index as a multi-index
toy_df = toy_df.set_index(['a','b'])
...how would I use those indexes to return the same result? I have tried...
filter_a = toy_df.index.isin(vehicles,level=0)
filter_b = toy_df.index.isin(vehicles,level=1)
filter_a
>> array([ True, True, True, False])
filter_b
>> array([False, True, True, True])
but I don't know how to use the filters with .loc
and I'm not sure if this is even the most efficient approach. Any guidance would be greatly appreciated.
Upvotes: 1
Views: 2149
Reputation: 2239
I used your initial dataframe, here as df, and run this code:
import time
df2 = df.set_index(['a', 'b'])
vehicles = ['van','car','bike']
for test in range(5):
def myfunc1():
df[(df.a.isin(vehicles))|(df.b.isin(vehicles))]
def myfunc2():
df2[(df2.index.isin(vehicles, level = 0))&(df2.index.isin(vehicles, level = 1))]
n = 1000
t0 = time.time()
for i in range(n): myfunc1()
t1 = time.time()
t2 = time.time()
for i in range(n): myfunc2()
t3 = time.time()
total_1 = t1-t0
total_2 = t3-t2
print(test, ":", total_1, total_2)
In the print out, you can see that using the filter on the filer takes less than half the time it takes to filter the columns.
0 : 0.8234035968780518 0.37520408630371094
1 : 0.7863156795501709 0.3657698631286621
2 : 0.7700819969177246 0.36788105964660645
3 : 0.7782289981842041 0.4089479446411133
4 : 0.8350069522857666 0.38277411460876465
Upvotes: 3