Manap Shymyr
Manap Shymyr

Reputation: 149

Is there any possibility to filter pandas by list?

I have a df with columns product_id and keyword. I am iterating over product id by for group_name, group in s.groupby('product_id') and in this step, I want to keep groups that contain all elements from the search list. for example: enter image description here

and my search list = ['land', 'cruiser']

so in the output, I want this:

enter image description here

I have tried this

data[data["keywords"].isin(search_words)]

but this method keeps all groups if it contains at least one element from the list, while I groups must contain all elements in the search list.

Upvotes: 0

Views: 732

Answers (2)

Rawson
Rawson

Reputation: 2787

For a different answer (in one line!):

import pandas as pd
s = pd.DataFrame({'product_id': [1, 1, 1, 1, 1, 2, 2, 3, 3, 4, 4],
                  'keywords': ["land", "cruiser", "land", "cruiser", "land",
                               "land", "land", "cruiser", "cruiser", "land", "cruiser"]})

search_words = ['land', 'cruiser']

s[[(all(p in x for p in search_words)) \
   for x in s.merge(s.groupby("product_id").apply(
           lambda x: list(set(x["keywords"]))).to_frame(). \
               rename(columns={0: "grouped"}), left_on="product_id",
               right_index=True)["grouped"]]]
#Out: 
#    product_id keywords
#0            1     land
#1            1  cruiser
#2            1     land
#3            1  cruiser
#4            1     land
#9            4     land
#10           4  cruiser

EDIT

I have just used pandas groupby filtration, and realised it would be much more simple to use this here instead. The returned dataframe is exactly the same as the one above, just using a lot less code!

s.groupby("product_id").filter(lambda x: all(p in list(set(x["keywords"])) for p in search_words))

Upvotes: 1

Levin
Levin

Reputation: 2005

The core logic here is lifted from S Rawson's answer.

Instead of a one-liner, I create some helper variables -- I make a series of product_id and all their associated search terms (product_keywords). Then I make a series (keepers) of all the product_id that have all the keywords. Finally I use the isin() method to test if dataframe rows have a product_id in that list.

This keeps all terms that are in groups that have all the sought terms; so if group 1 has "land" and "cruiser" and also other words, this gets all of the terms.

df = pd.DataFrame({"product_id": [1,1, 2, 3, 1, 2, 3],
                   "keywords": ["land", "barn", "land", "land", "cruiser", "barn", "cruiser"]})

search_list = set(['land', 'cruiser'])

product_keywords = df.groupby("product_id").apply(lambda x: list(set(x["keywords"])))

product_keywords
# product_id
# 1    [barn, cruiser, land]
# 2             [barn, land]
# 3          [cruiser, land]

keepers = product_keywords[product_keywords.apply(lambda found_terms: all(search_term in found_terms for search_term in search_list))]

keepers
# product_id
# 1    [barn, cruiser, land]
# 3          [cruiser, land]


df.loc[df['product_id'].isin(keepers.index)]  
#    product_id keywords
# 0           1     land
# 1           1     barn
# 3           3     land
# 4           1  cruiser
# 6           3  cruiser

Upvotes: 1

Related Questions