Reputation: 369
Say I have a pandas dataframe that looks like this:
color number
0 red 3
1 blue 4
2 green 2
3 blue 2
I want to get the first value from the number column where the color column has the value 'blue'
which in this case would return 4
.
I know this can be done using loc
in something like this:
df[df['color'] == 'blue']['number'][0]
I'm wondering if there is any more optimal approach given that I only ever need the first occurrence.
Upvotes: 2
Views: 9664
Reputation: 164623
idx = next(iter(df['number'].iloc[np.where(df['color'].eq('blue'))]), -1) # 4
Note this also handles the case where the colour does not exist. In comparison, df['color'].eq('orange').idxmax()
gives 0
even though 'orange'
does not exist in the series. The above logic will give -1
.
numba
I'm wondering if there is any more optimal approach given that I only ever need the first occurrence.
Yes! For a more efficient solution, see Efficiently return the index of the first value satisfying condition in array. Numba allows you to iterate row-wise efficiently. In this case, you will need to factorize your strings first so that you feed numeric arrays only to Numba:
from numba import njit
# factorize series, pd.factorize maintains order,
# i.e. first item in values gives 0 index
idx, values = pd.factorize(df['color'])
idx_search = np.where(values == 'blue')[0][0]
@njit
def get_first_index_nb(A, k):
for i in range(len(A)):
if A[i] == k:
return i
return -1
res = df['number'].iat[get_first_index_nb(idx, 1)] # 4
Of course, for a one-off calculation, this is inefficient. But for successive calculations, the solution will likely be a factor faster than solutions which check for equality across the entire series / array.
Upvotes: 1
Reputation: 402333
Use head
—this will return the first row if the color exists, and an empty Series
otherwise.
col = 'blue'
df.query('color == @col').head(1).loc[:, 'number']
1 4
Name: number, dtype: int64
Alternatively, to get a single item, use obj.is_empty
:
u = df.query('color == @col').head(1)
if not u.is_empty:
print(u.at[u.index[0], 'number'])
# 4
Difference between head
and idxmax
for invalid color:
df.query('color == "blabla"').head(1).loc[:, 'number']
# Series([], Name: number, dtype: int64)
df.loc[(df['color'] == 'blabla').idxmax(),'number']
# 3
Upvotes: 3
Reputation: 323226
Using idxmax
df.loc[(df['color'] == 'blue').idxmax(),'number']
Out[698]: 4
Upvotes: 2