Reputation: 9869
Suppose I have a structured dataframe as follows:
df = pd.DataFrame({"A":['a','a','a','b','b'],
"B":[1]*5})
The A
column has previously been sorted. I wish to find the first row index of where df[df.A!='a']
. The end goal is to use this index to break the data frame into groups based on A
.
Now I realise that there is a groupby functionality. However, the dataframe is quite large and this is a simplified toy example. Since A
has been sorted already, it would be faster if I can just find the 1st index of where df.A!='a'
. Therefore it is important that whatever method that you use the scanning stops once the first element is found.
Upvotes: 80
Views: 174523
Reputation: 5191
Generalized Form:
index = df.loc[df.column_name == 'value_you_looking_for'].index[0]
Example:
index_of_interest = df.loc[df.A == 'a'].index[0]
Upvotes: 2
Reputation: 13582
Using pandas groupby()
to group by column or list of columns. Then first()
to get the first value in each group.
import pandas as pd
df = pd.DataFrame({"A":['a','a','a','b','b'],
"B":[1]*5})
#Group df by column and get the first value in each group
grouped_df = df.groupby("A").first()
#Reset indices to match format
first_values = grouped_df.reset_index()
print(first_values)
>>> A B
0 a 1
1 b 1
Upvotes: 13
Reputation: 776
You can iterate by dataframe rows (it is slow) and create your own logic to get values that you wanted:
def getMaxIndex(df, col)
max = -999999
rtn_index = 0
for index, row in df.iterrows():
if row[col] > max:
max = row[col]
rtn_index = index
return rtn_index
Upvotes: 0
Reputation: 38415
If you just want to find the first instance without going through the entire dataframe, you can go the for-loop way.
df = pd.DataFrame({"A":['a','a','a','b','b'],"B":[1]*5})
for index in range(len(df['A'])):
if df['A'][index] != 'a':
print(index)
break
The index is the row number of the 1st index of where df.A!='a'
Upvotes: 1
Reputation: 5273
For multiple conditions:
Let's say we have:
s = pd.Series(['a', 'a', 'c', 'c', 'b', 'd'])
And we want to find the first item different than a and c, we do:
n = np.logical_and(s.values != 'a', s.values != 'c').argmax()
Times:
import numpy as np
import pandas as pd
from datetime import datetime
ITERS = 1000
def pandas_multi_condition(s):
ts = datetime.now()
for i in range(ITERS):
n = s[(s != 'a') & (s != 'c')].index[0]
print(n)
print(datetime.now() - ts)
def numpy_bitwise_and(s):
ts = datetime.now()
for i in range(ITERS):
n = np.logical_and(s.values != 'a', s.values != 'c').argmax()
print(n)
print(datetime.now() - ts)
s = pd.Series(['a', 'a', 'c', 'c', 'b', 'd'])
print('pandas_multi_condition():')
pandas_multi_condition(s)
print()
print('numpy_bitwise_and():')
numpy_bitwise_and(s)
Output:
pandas_multi_condition():
4
0:00:01.144767
numpy_bitwise_and():
4
0:00:00.019013
Upvotes: 2
Reputation: 1530
I found there is first_valid_index function for Pandas DataFrames that will do the job, one could use it as follows:
df[df.A!='a'].first_valid_index()
3
However, this function seems to be very slow. Even taking the first index of the filtered dataframe is faster:
df.loc[df.A!='a','A'].index[0]
Below I compare the total time(sec) of repeating calculations 100 times for these two options and all the codes above:
total_time_sec ratio wrt fastest algo
searchsorted numpy: 0.0007 1.00
argmax numpy: 0.0009 1.29
for loop: 0.0045 6.43
searchsorted pandas: 0.0075 10.71
idxmax pandas: 0.0267 38.14
index[0]: 0.0295 42.14
first_valid_index pandas: 0.1181 168.71
Notice numpy's searchsorted is the winner and first_valid_index shows worst performance. Generally, numpy algorithms are faster, and the for loop does not do so bad, but it's just because the dataframe has very few entries.
For a dataframe with 10,000 entries where the desired entries are closer to the end the results are different, with searchsorted delivering the best performance:
total_time_sec ratio wrt fastest algo
searchsorted numpy: 0.0007 1.00
searchsorted pandas: 0.0076 10.86
argmax numpy: 0.0117 16.71
index[0]: 0.0815 116.43
idxmax pandas: 0.0904 129.14
first_valid_index pandas: 0.1691 241.57
for loop: 9.6504 13786.29
The code to produce these results is below:
import timeit
# code snippet to be executed only once
mysetup = '''import pandas as pd
import numpy as np
df = pd.DataFrame({"A":['a','a','a','b','b'],"B":[1]*5})
'''
# code snippets whose execution time is to be measured
mycode_set = ['''
df[df.A!='a'].first_valid_index()
''']
message = ["first_valid_index pandas:"]
mycode_set.append( '''df.loc[df.A!='a','A'].index[0]''')
message.append("index[0]: ")
mycode_set.append( '''df.A.ne('a').idxmax()''')
message.append("idxmax pandas: ")
mycode_set.append( '''(df.A.values != 'a').argmax()''')
message.append("argmax numpy: ")
mycode_set.append( '''df.A.searchsorted('a', side='right')''')
message.append("searchsorted pandas: ")
mycode_set.append( '''df.A.values.searchsorted('a', side='right')''' )
message.append("searchsorted numpy: ")
mycode_set.append( '''for index in range(len(df['A'])):
if df['A'][index] != 'a':
ans = index
break
''')
message.append("for loop: ")
total_time_in_sec = []
for i in range(len(mycode_set)):
mycode = mycode_set[i]
total_time_in_sec.append(np.round(timeit.timeit(setup = mysetup,\
stmt = mycode, number = 100),4))
output = pd.DataFrame(total_time_in_sec, index = message, \
columns = ['total_time_sec' ])
output["ratio wrt fastest algo"] = \
np.round(output.total_time_sec/output["total_time_sec"].min(),2)
output = output.sort_values(by = "total_time_sec")
display(output)
For the larger dataframe:
mysetup = '''import pandas as pd
import numpy as np
n = 10000
lt = ['a' for _ in range(n)]
b = ['b' for _ in range(5)]
lt[-5:] = b
df = pd.DataFrame({"A":lt,"B":[1]*n})
'''
Upvotes: 45
Reputation: 294338
idxmax
and argmax
will return the position of the maximal value or the first position if the maximal value occurs more than once.
use idxmax
on df.A.ne('a')
df.A.ne('a').idxmax()
3
or the numpy
equivalent
(df.A.values != 'a').argmax()
3
However, if A
has already been sorted, then we can use searchsorted
df.A.searchsorted('a', side='right')
array([3])
Or the numpy
equivalent
df.A.values.searchsorted('a', side='right')
3
Upvotes: 83