Salil Sharma
Salil Sharma

Reputation: 79

Efficient and faster implementation of finding and matching unique values in a pandas dataframe

Regarding the following Pandas dataframe,

idx = pd.MultiIndex.from_product([['A001', 'B001','C001'],
                                  ['0', '1', '2']],
                                 names=['ID', 'Entries'])
col = ['A', 'B']

df = pd.DataFrame('-', idx, col)
df.loc['A001', 'A'] = [10,10,10]
df.loc['A001', 'B'] = [90,84,70]
df.loc['B001', 'A'] = [10,20,30]
df.loc['B001', 'B'] = [70,86,67]
df.loc['C001', 'A'] = [20,20,20]
df.loc['C001', 'B'] = [98,81,72]
df.loc['D001', 'A'] = [20,20,10]
df.loc['D001', 'B'] = [68,71,92]
#df is a dataframe
df

I am interested to know the Ids which include the all the values from a set or list in their 'A' column. Let's define a list with values as [10,20]. In this case, I should get locations 'B001' and 'D001' as the answer since both these locations have the values mentioned the list in their 'A' column . Further can you suggest a faster implementation since I have to work on really big data set.

Upvotes: 1

Views: 68

Answers (2)

jpp
jpp

Reputation: 164643

You can use set.intersection for your calculation, and pd.Index.get_level_values to extract the first level of your index:

search = {10, 20}

idx = (set(df[df['A'] == i].index.get_level_values(0)) for i in search)

res = set.intersection(*idx)

Upvotes: 3

Vivek Kalyanarangan
Vivek Kalyanarangan

Reputation: 9081

Basically -

search_list = {10,20}
op = df.groupby(level=0)['A'].apply(lambda x: search_list.issubset(set(x))).reset_index()
print(op[op['A']]['ID'])

Thanks @Ben.T for taking out the unnecessary unique()

Output

1    B001
Name: ID, dtype: object

Explanation

df.groupby(level=0)['A'] groups by level 0 and gives you the lists -

ID
A001            [10]
B001    [10, 20, 30]
C001            [20]

Next, for each of these lists, we convert it into a set and check whether the search_list is a subset.

ID
A001    False
B001     True
C001    False

It returns a Series of boolean values which can then be used as a mask -

print(op[op['A']]['ID'])

Final Output -

1    B001

Upvotes: 1

Related Questions