Florian
Florian

Reputation: 25395

Filter dataframe matching all values of a vector in Python

I am trying to solve this question with Python.

ID = np.concatenate((np.repeat("A",5),
                    np.repeat("B",4), 
                    np.repeat("C",2)))
Hour = np.array([0,2,5,6,9,0,2,5,6,0,2])
testVector = [0,2,5]    
df = pd.DataFrame({'ID' : ID, 'Hour': Hour})

We group the rows by ID, then we want to remove all rows from df where not all values in testVector are found in the column Hour of that group. We could achieve that as follows:

def all_in(x,y):
    return all([z in list(x) for z in y])

to_keep = df.groupby(by='ID')['Hour'].aggregate(lambda x: all_in(x,testVector))
to_keep = list(to_keep[to_keep].index)
df = df[df['ID'].isin(to_keep)]

I want to make this code as short and efficient as possible. Any suggestions for improvements or alternative solution approaches?

Upvotes: 2

Views: 1542

Answers (3)

jezrael
jezrael

Reputation: 863166

Create sets for each ID from Hour column first. Then map for new Series which is compared with vector:

df = df[df['ID'].map(df.groupby(by='ID')['Hour'].apply(set)) >= set(testVector)]
print (df)
   Hour ID
0     0  A
1     2  A
2     5  A
3     6  A
4     9  A
5     0  B
6     2  B
7     5  B
8     6  B

Timings:

np.random.seed(123)
N = 1000000

df = pd.DataFrame({'ID': np.random.randint(200, size=N),
                   'Hour': np.random.choice(range(10000),N)})
print (df)
testVector = [0,2,5] 

test_set = set(testVector)
s = pd.Series(testVector)

#maxu sol
In [259]: %timeit (df.loc[df.groupby('ID').Hour.transform(lambda x: set(x) & test_set == test_set)])
1 loop, best of 3: 356 ms per loop

#jez sol
In [260]: %timeit (df[df['ID'].map(df.groupby(by='ID')['Hour'].apply(set)) >= set(testVector)])
1 loop, best of 3: 462 ms per loop

#ayhan sol1
In [261]: %timeit (df[df.groupby('ID')['Hour'].transform(lambda x: s.isin(x).all())])
1 loop, best of 3: 300 ms per loop

#ayhan sol2
In [263]: %timeit (df.groupby('ID').filter(lambda x: s.isin(x['Hour']).all()))
1 loop, best of 3: 211 ms per loop

Upvotes: 2

user2285236
user2285236

Reputation:

Similar to MaxU's solution but I used a Series instead of a set:

testVector = pd.Series(testVector)
df[df.groupby('ID')['Hour'].transform(lambda x: testVector.isin(x).all())]
Out: 
   Hour ID
0     0  A
1     2  A
2     5  A
3     6  A
4     9  A
5     0  B
6     2  B
7     5  B
8     6  B

Filter might be more idiomatic here though:

df.groupby('ID').filter(lambda x: testVector.isin(x['Hour']).all())
Out: 
   Hour ID
0     0  A
1     2  A
2     5  A
3     6  A
4     9  A
5     0  B
6     2  B
7     5  B
8     6  B

Upvotes: 2

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210882

In [99]: test_set = set(testVector)

In [100]: df.loc[df.groupby('ID').Hour.transform(lambda x: set(x) & test_set == test_set)]
Out[100]:
   Hour ID
0     0  A
1     2  A
2     5  A
3     6  A
4     9  A
5     0  B
6     2  B
7     5  B
8     6  B

Explanation:

in the lambda x: set(x) & test_set == test_set) function we create a set of Hour values for each group:

In [104]: df.groupby('ID').Hour.apply(lambda x: set(x))
Out[104]:
ID
A    {0, 2, 5, 6, 9}
B       {0, 2, 5, 6}
C             {0, 2}
Name: Hour, dtype: object

Then we do set intersection with the test_set:

In [105]: df.groupby('ID').Hour.apply(lambda x: set(x) & test_set)
Out[105]:
ID
A    {0, 2, 5}
B    {0, 2, 5}
C       {0, 2}
Name: Hour, dtype: object

and compare it with the test_set again:

In [106]: df.groupby('ID').Hour.apply(lambda x: set(x) & test_set == test_set)
Out[106]:
ID
A     True
B     True
C    False
Name: Hour, dtype: bool

PS I used .apply() instead of .transform just for showing how it works.

But we need to use transform in order to use boolean indexing later on:

In [107]: df.groupby('ID').Hour.transform(lambda x: set(x) & test_set == test_set)
Out[107]:
0      True
1      True
2      True
3      True
4      True
5      True
6      True
7      True
8      True
9     False
10    False
Name: Hour, dtype: bool

Upvotes: 3

Related Questions