Reputation: 179
I have a data frame, df that has 10 million rows. I am running the below loop that takes a lot of time to execute. Can there be a faster way to do the same task?
for i in range(len(df)):
if df['col_1'][i] not in ['a', 'b']:
df.at[i,'col_1'] = np.nan
Upvotes: 1
Views: 54
Reputation: 862406
For better performance use numpy.where
with convert values to 1d array
by Series.values
:
df['col_1'] = np.where(df['col_1'].isin(['a', 'b']),
df['col_1'].values,
np.nan)
#pandas 0.24+
df['col_1'] = np.where(df['col_1'].isin(['a', 'b']),
df['col_1'].to_numpy(),
np.nan)
Test for 1%
of a,b
values:
np.random.seed(2019)
N = 10 ** 7
df = pd.DataFrame({'col_1':np.random.choice(['a','b','c'], p=(.05,.05,.9),size=N)})
#print (df)
In [87]: %timeit df['col_1'] = np.where(df['col_1'].isin(['a', 'b']), df['col_1'].values, np.nan)
425 ms ± 2.17 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [88]: %timeit df['col_1'] = np.where(df['col_1'].isin(['a', 'b']), df['col_1'], np.nan)
442 ms ± 12.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [89]: %timeit df.loc[~df['col_1'].isin(['a', 'b']), 'col_1'] = np.nan
537 ms ± 4.95 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Test for 50%
of a,b
values:
np.random.seed(2019)
N = 10 ** 7
df = pd.DataFrame({'col_1':np.random.choice(['a','b','c'], p=(.25,.25,.5),size=N)})
print (df)
In [101]: %timeit df['col_1'] = np.where(df['col_1'].isin(['a', 'b']), df['col_1'].values, np.nan)
532 ms ± 3.89 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [102]: %timeit df['col_1'] = np.where(df['col_1'].isin(['a', 'b']), df['col_1'], np.nan)
533 ms ± 4.84 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [103]: %timeit df.loc[~df['col_1'].isin(['a', 'b']), 'col_1'] = np.nan
602 ms ± 2.99 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Test for 90%
of a,b
values:
np.random.seed(2019)
N = 10 ** 7
df = pd.DataFrame({'col_1':np.random.choice(['a','b','c'], p=(.45,.45,.1),size=N)})
print (df)
In [106]: %timeit df['col_1'] = np.where(df['col_1'].isin(['a', 'b']), df['col_1'].values, np.nan)
517 ms ± 3.66 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [107]: %timeit df['col_1'] = np.where(df['col_1'].isin(['a', 'b']), df['col_1'], np.nan)
520 ms ± 2.67 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [108]: %timeit df.loc[~df['col_1'].isin(['a', 'b']), 'col_1'] = np.nan
557 ms ± 2.76 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Upvotes: 1
Reputation: 2022
Try this:
df.loc[~df['col_1'].isin(['a', 'b']), 'col_1'] = np.nan
Upvotes: 2