James Allen-Robertson
James Allen-Robertson

Reputation: 571

Pandas Multi-index .isin ALL index columns

I have a dataframe and a list of values.

vehicles = ['van','car','bike']

a b c
car dog 1
van bike 2
van car 3
dog van 4

To find all rows where both values are a vehicle I could to the following

toy_df[ ( toy_df['a'].isin(vehicles) ) & ( toy_df['b'].isin(vehicles) ) ]

... and this works fine. However my data is much larger, and I'm thinking using indexes would be more efficient. If I were to set the index as a multi-index

toy_df = toy_df.set_index(['a','b'])

...how would I use those indexes to return the same result? I have tried...

filter_a = toy_df.index.isin(vehicles,level=0)
filter_b = toy_df.index.isin(vehicles,level=1)

filter_a
>>  array([ True,  True,  True, False])

filter_b
>> array([False,  True,  True,  True])

but I don't know how to use the filters with .loc and I'm not sure if this is even the most efficient approach. Any guidance would be greatly appreciated.

Upvotes: 1

Views: 2149

Answers (1)

Jorge
Jorge

Reputation: 2239

I used your initial dataframe, here as df, and run this code:

import time

df2 = df.set_index(['a', 'b'])

vehicles = ['van','car','bike']
for test in range(5):
    def myfunc1():
        df[(df.a.isin(vehicles))|(df.b.isin(vehicles))]

    def myfunc2():
        df2[(df2.index.isin(vehicles, level = 0))&(df2.index.isin(vehicles, level = 1))]

    n = 1000
    t0 = time.time()
    for i in range(n): myfunc1()
    t1 = time.time()

    t2 = time.time()
    for i in range(n): myfunc2()
    t3 = time.time()


    total_1 = t1-t0
    total_2 = t3-t2

    print(test, ":", total_1, total_2)

In the print out, you can see that using the filter on the filer takes less than half the time it takes to filter the columns.

0 : 0.8234035968780518 0.37520408630371094
1 : 0.7863156795501709 0.3657698631286621
2 : 0.7700819969177246 0.36788105964660645
3 : 0.7782289981842041 0.4089479446411133
4 : 0.8350069522857666 0.38277411460876465

Upvotes: 3

Related Questions