Reputation: 25395
I am trying to solve this question with Python.
ID = np.concatenate((np.repeat("A",5),
np.repeat("B",4),
np.repeat("C",2)))
Hour = np.array([0,2,5,6,9,0,2,5,6,0,2])
testVector = [0,2,5]
df = pd.DataFrame({'ID' : ID, 'Hour': Hour})
We group the rows by ID
, then we want to remove all rows from df
where not all values in testVector
are found in the column Hour
of that group. We could achieve that as follows:
def all_in(x,y):
return all([z in list(x) for z in y])
to_keep = df.groupby(by='ID')['Hour'].aggregate(lambda x: all_in(x,testVector))
to_keep = list(to_keep[to_keep].index)
df = df[df['ID'].isin(to_keep)]
I want to make this code as short and efficient as possible. Any suggestions for improvements or alternative solution approaches?
Upvotes: 2
Views: 1542
Reputation: 863166
Create set
s for each ID
from Hour
column first. Then map
for new Series
which is compared with vector:
df = df[df['ID'].map(df.groupby(by='ID')['Hour'].apply(set)) >= set(testVector)]
print (df)
Hour ID
0 0 A
1 2 A
2 5 A
3 6 A
4 9 A
5 0 B
6 2 B
7 5 B
8 6 B
Timings:
np.random.seed(123)
N = 1000000
df = pd.DataFrame({'ID': np.random.randint(200, size=N),
'Hour': np.random.choice(range(10000),N)})
print (df)
testVector = [0,2,5]
test_set = set(testVector)
s = pd.Series(testVector)
#maxu sol
In [259]: %timeit (df.loc[df.groupby('ID').Hour.transform(lambda x: set(x) & test_set == test_set)])
1 loop, best of 3: 356 ms per loop
#jez sol
In [260]: %timeit (df[df['ID'].map(df.groupby(by='ID')['Hour'].apply(set)) >= set(testVector)])
1 loop, best of 3: 462 ms per loop
#ayhan sol1
In [261]: %timeit (df[df.groupby('ID')['Hour'].transform(lambda x: s.isin(x).all())])
1 loop, best of 3: 300 ms per loop
#ayhan sol2
In [263]: %timeit (df.groupby('ID').filter(lambda x: s.isin(x['Hour']).all()))
1 loop, best of 3: 211 ms per loop
Upvotes: 2
Reputation:
Similar to MaxU's solution but I used a Series instead of a set:
testVector = pd.Series(testVector)
df[df.groupby('ID')['Hour'].transform(lambda x: testVector.isin(x).all())]
Out:
Hour ID
0 0 A
1 2 A
2 5 A
3 6 A
4 9 A
5 0 B
6 2 B
7 5 B
8 6 B
Filter might be more idiomatic here though:
df.groupby('ID').filter(lambda x: testVector.isin(x['Hour']).all())
Out:
Hour ID
0 0 A
1 2 A
2 5 A
3 6 A
4 9 A
5 0 B
6 2 B
7 5 B
8 6 B
Upvotes: 2
Reputation: 210882
In [99]: test_set = set(testVector)
In [100]: df.loc[df.groupby('ID').Hour.transform(lambda x: set(x) & test_set == test_set)]
Out[100]:
Hour ID
0 0 A
1 2 A
2 5 A
3 6 A
4 9 A
5 0 B
6 2 B
7 5 B
8 6 B
Explanation:
in the lambda x: set(x) & test_set == test_set)
function we create a set of Hour
values for each group:
In [104]: df.groupby('ID').Hour.apply(lambda x: set(x))
Out[104]:
ID
A {0, 2, 5, 6, 9}
B {0, 2, 5, 6}
C {0, 2}
Name: Hour, dtype: object
Then we do set intersection with the test_set
:
In [105]: df.groupby('ID').Hour.apply(lambda x: set(x) & test_set)
Out[105]:
ID
A {0, 2, 5}
B {0, 2, 5}
C {0, 2}
Name: Hour, dtype: object
and compare it with the test_set
again:
In [106]: df.groupby('ID').Hour.apply(lambda x: set(x) & test_set == test_set)
Out[106]:
ID
A True
B True
C False
Name: Hour, dtype: bool
PS I used .apply()
instead of .transform
just for showing how it works.
But we need to use transform in order to use boolean indexing later on:
In [107]: df.groupby('ID').Hour.transform(lambda x: set(x) & test_set == test_set)
Out[107]:
0 True
1 True
2 True
3 True
4 True
5 True
6 True
7 True
8 True
9 False
10 False
Name: Hour, dtype: bool
Upvotes: 3