Reputation: 3487
for example, I have a dataframe
with these 2 columns a
, and b
:
a = [1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3,3]
b = [1,2,3,4,5,6,7,8,9,0,1,2,3,4,5,6,7,8,9,0,1]
I am expecting filtered dataframe: [5,6,7,2,3,4,9,0,1]
Without using the groupby
function (because it is take too long time with a very large dataframe
, it just not usable), how do I filter with the last 3 items from each group in col. a
?
Upvotes: 2
Views: 73
Reputation: 221664
Approach #1 : Here's a NumPy based approach -
In [89]: a = np.array([1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3,3])
...: b = np.array([1,2,3,4,5,6,7,8,9,0,1,2,3,4,5,6,7,8,9,0,1])
...:
In [90]: idx = np.append(np.nonzero(a[1:] > a[:-1])[0], a.size-1)[:,None] - [2,1,0]
In [91]: b[idx].ravel()
Out[91]: array([5, 6, 7, 2, 3, 4, 9, 0, 1])
If you are receiving those from the columns of a dataframe, df
named a
and b
respectively, as the pre-processing step, we need to extract those as arrays, like so -
a = df.a.values
b = df.b.values
Please note that this assumes at least three elements per group. For cases with lesser than 3
elems per group read on to the next approach.
Approach #2 : With Scipy's binary dilation
to create a mask for selecting elements off b
-
from scipy.ndimage.morphology import binary_dilation as imdilate
def filter_lastN(a, b, N):
mask = np.zeros(a.size,dtype=bool)
mask[np.append(np.nonzero(a[1:] > a[:-1])[0],b.size-1)] = 1
return b[imdilate(mask,np.ones(N),origin=(N-1)//2)]
Sample run -
In [198]: a
Out[198]: array([1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3])
In [199]: b
Out[199]: array([5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1])
In [200]: filter_lastN(a,b,3)
Out[200]: array([5, 6, 7, 2, 3, 4, 9, 0, 1])
In [201]: filter_lastN(a,b,5)
Out[201]: array([5, 6, 7, 0, 1, 2, 3, 4, 7, 8, 9, 0, 1])
Upvotes: 2
Reputation: 863301
You can use drop_duplicates
first for last rows of groups
, then get previous index values and last select by loc
:
a = [1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3,3]
b = [1,2,3,4,5,6,7,8,9,0,1,2,3,4,5,6,7,8,9,0,1]
df = pd.DataFrame({'a': a, 'b':b})
print (df)
a b
0 1 1
1 1 2
2 1 3
3 1 4
4 1 5
5 1 6
6 1 7
7 2 8
8 2 9
9 2 0
10 2 1
11 2 2
12 2 3
13 2 4
14 3 5
15 3 6
16 3 7
17 3 8
18 3 9
19 3 0
20 3 1
df1 = df.drop_duplicates('a',keep='last')
print (df1)
a b
6 1 7
13 2 4
20 3 1
idx = sorted(df1.index.tolist() + (df1.index - 1).tolist() + (df1.index - 2).tolist())
print (idx)
[4, 5, 6, 11, 12, 13, 18, 19, 20]
print (df.loc[idx])
a b
4 1 5
5 1 6
6 1 7
11 2 2
12 2 3
13 2 4
18 3 9
19 3 0
20 3 1
Upvotes: 2