Reputation: 999
Consider a python dataframe
A B C
1 random imp1
2 random imp2
5 random imp3
1 yes ---
2 yes ---
3 no ---
4 no ---
5 yes ---
Whenever column B has a value yes, I want to fetch values of A. And eventually for those values of A, I want C when those values occurred 1st in A. So in this case, I finally want imp1, imp2 and imp3.
Is there an elegant way of doing this.
Upvotes: 3
Views: 42
Reputation: 294258
This should be very fast
a = df.A.values
b = df.B.values == 'yes'
d = df.drop_duplicates('A')
d.C[np.in1d(d.A.values, a[b])]
0 imp1
1 imp2
2 imp3
Name: C, dtype: object
Over the top approach. About 50% faster than my other approach.
from numba import njit
@njit
def proc(f, m):
mx = f.max() + 1
a = [False] * mx
b = [0] * mx
z = [0] * f.size
for i in range(f.size):
x = f[i]
y = m[i]
b[x] += 1
z[i] = b[x]
a[x] = a[x] or y
return np.array(z) == 1, np.array(a)[f]
df.C[np.logical_and(*proc(pd.factorize(df.A.values)[0], df.B.values == 'yes'))]
0 imp1
1 imp2
2 imp3
Name: C, dtype: object
Upvotes: 1
Reputation: 862661
You can use boolean indexing
with loc
first, then duplicated
and last filter with isin
by values a
:
a = df.loc[df['B'] == 'yes', 'A']
df = df.drop_duplicates('A')
df = df.loc[df['A'].isin(a), 'C']
print (df)
0 imp1
1 imp2
2 imp3
Name: C, dtype: object
Timings:
np.random.seed(123)
N = 1000000
df = pd.DataFrame({'B': np.random.choice(['yes','no', 'a', 'b', 'c'], N),
'A':np.random.randint(1000, size=N),
'C':np.random.randint(1000, size=N)})
print (df)
print (df[df.A.isin(df[df.B == 'yes'].A)].drop_duplicates('A').C)
print (df[df.A.isin(df[df.B == 'yes'].drop_duplicates('A').A)].C)
def fjez(df):
a = df.loc[df['B'] == 'yes', 'A']
df = df.drop_duplicates('A')
return df.loc[df['A'].isin(a), 'C']
def fpir(df):
a = df.A.values
b = df.B.values == 'yes'
d = df.drop_duplicates('A')
return d.C[np.in1d(d.A.values, a[b])]
print (fjez(df))
print (fpir(df))
In [296]: %timeit (df[df.A.isin(df[df.B == 'yes'].A)].drop_duplicates('A').C)
1 loop, best of 3: 226 ms per loop
In [297]: %timeit (df[df.A.isin(df[df.B == 'yes'].drop_duplicates('A').A)].C)
1 loop, best of 3: 185 ms per loop
In [298]: %timeit (fjez(df))
10 loops, best of 3: 156 ms per loop
In [299]: %timeit (fpir(df))
10 loops, best of 3: 87.1 ms per loop
Upvotes: 2
Reputation: 153460
Let's use this one-liner:
df[df.A.isin(df[df.B == 'yes'].A)].drop_duplicates('A').C
Output:
0 imp1
1 imp2
2 imp3
Name: C, dtype: object
Upvotes: 2