Ivan Shelonik
Ivan Shelonik

Reputation: 2028

Filter simultaneously by different values of rows Pandas

I have a huge dataframe with product_id and their property_id's. Note that for each property starts with new index. I need to filter simultaneously by different property_id values for each product_id. Is there any way to do it fast?

out_df

product_id  property_id
0   3588    1
1   3588    2
2   3588    5
3   3589    1
4   3589    3
5   3589    5
6   3590    1
7   3590    2
8   3590    5

For example want kinda that to filter for each product_id by two properties that are assigned at different rows like out_df.loc[(out_df['property_id'] == 1) & (out_df['property_id'] == 2)] but instead of it). I need something like that but working at the same time for all rows of each product_id column.

I know that it can be done via groupby into lists

3587    [2, 1, 5]
3588    [1, 3, 5]
3590    [1, 2, 5]

and finding intersections inside lists.

gp_df.apply(lambda r: {1, 2} < (set(r['property_id'])), axis=1)  

But it takes time and at the same time Pandas common filtering is greatly optimized for speed (believe in using some tricky right and inverse indexes inside what do search engines like ElasticSearch, Sphinx etc) .

Expected output: where both {1 and 2} are having.

3587    [2, 1, 5]
3590    [1, 2, 5]

Upvotes: 0

Views: 83

Answers (3)

chjortlund
chjortlund

Reputation: 4027

Since this is just as much a performance as a functional question, I would go with an intersection approach like this:

df = pd.DataFrame({'product_id': [3588, 3588, 3588, 3589, 3589, 3589, 3590, 3590,3590], 
                   'property_id': [1, 2, 5, 1, 3, 5, 1, 2, 5]})

df = df.set_index(['property_id'])

print("The full DataFrame:")
print(df)

start = time()

for i in range(1000):
    s1 = df.loc[(1), 'product_id']
    s2 = df.loc[(2), 'product_id']

    s_done = pd.Series(list(set(s1).intersection(set(s2))))

print("Overlapping product_id's")
print(time()-start)

Iterating the lookup 1000 times takes 0.93 seconds on my ThinkPad T450s. I took the liberty to test @jezrael's two suggestions and they come in at 2.11 and 2.00 seconds, the groupby approach is, software engineering wise, more elegant though.

Depending on the size of your data set and the importance of performance, you can also switch to more simple datatypes, like classic dictionaries and gain further speed.


Jupyter Notebook can be found here: pandas_fast_lookup_using_intersection.ipynb

Upvotes: 1

jezrael
jezrael

Reputation: 863301

The simpliest is use GroupBy.transform with compare sets:

s = {1, 2}
a = df[df.groupby('product_id')['property_id'].transform(lambda r: s < set(r))]
print (a)
   product_id  property_id
0        3588            1
1        3588            2
2        3588            5
6        3590            1
7        3590            2
8        3590            5

Another solution is filter only values of sets, removing duplicates first:

df1 = df[df['property_id'].isin(s) & ~df.duplicated(['product_id', 'property_id'])]

Then is necessary check if lengths of each group is same as length of set with this solution:

f, u = df1['product_id'].factorize()
ids = df1.loc[np.bincount(f)[f] == len(s), 'product_id'].unique()

Last filter all rows with product_id by condition:

a = df[df['product_id'].isin(ids)]
print (a)
   product_id  property_id
0        3588            1
1        3588            2
2        3588            5
6        3590            1
7        3590            2
8        3590            5

Upvotes: 1

freddy888
freddy888

Reputation: 1010

do you mean something like this?

result = out_df.loc[out_df['property_id'].isin([1,2]), :]

If you want you can then drop duplicates based on product_id...

Upvotes: 1

Related Questions